Tuesday, 2 December 2014

Using negative feedback to stabilize software - 1

What does software stabilizing mean? We talked about stability patterns in software ecosystem before and saw that both in multi binary software systems integration (ecosystem) or in a complex modular software we need to take care of how different binaries or modules work together. Wrong behavior of one binary or module may lead to a disaster in the whole system. In fact, most of the time we need to consider some negative feedback loop to prevent propagating faults or letting the ecosystem goes to some unknown situation. The loop can be some part of the binaries or it can be an external binary itself talking to other binaries.

I remind you that we need feedback loops in a complex system in which we can't exactly be aware of what components are doing so instead of an exact following of what they are doing we just check and control the system's outputs and try to adjust the components behavior to keep the system up and running.



Simple 3 components software ecosystem
Process dispatching example
Consider a simple software ecosystem which one process reads text files and stores each line in an external multi-access queue and a process which does some calculation on each line. That's it an ecosystem with just 3 components. This component can be all in a single binary in the form of multithreading or as separate binaries working together in a network.

This integration works and will be stable as long as the performance of its components is much more than the available input data. For example, if we have just one text file per second and a maximum of 100 lines per text file, we have in fact 100 of lines per minute and if all of these 3 components support for example 1000 lines per second there will be no problem at all, the system works well. But what if the number of the text files per second or size of the files grow, or the components have different performance? We can't keep the system stable with this simple design.

Consider this scenario, we don't know how many files in a second we have to read and don't know the size of the files. The queue performance depends on the number of empty cells it has and the calculation process also depends on the stored data in each line, it may take milliseconds or seconds and even minutes. So what will happen, a disaster?

Disaster will happen when customer uses the system
Just a title, but a truth. Customer starts using the system, so starts feeding system as much as files he has, the queue gets full we have no mechanism to control the read process, perhaps many exceptions appears in both queue and read process if we don't show the warnings to the user, he thinks the system works well. The calculation process, on the other hand, doesn't have required process to do the calculation for whole queue size lines in a second, this also makes the queue gets full. If we don't show any warning or any index like queue size or workload and ... to user the only way he/she can find out there is something wrong with the system is that result of the work doesn't satisfy him/her.

Although we can fix the system just by controlling the LQ (line queue) size and when we see the LQ is full stop reading the files, but it doesn't fulfill the users requirements, they want to process the files as fast as possible. We need to redesign our whole system to work better, faster and be stable under any load.

No comments:

Post a Comment