sleptons: February 2015

Thursday, 26 February 2015

Human neocortex

Before all else let me say I have no deep knowledge and experience in biology or neuroscience so most of the things I'm going to talk about is based on the Jeff Hawkins and some other scientists papers and work. I usually like to get ideas from nature and then design software systems based on these ideas, because this is an already proven way that a complicated modular system works. So I listen to these guys and try to understand as much as I can, then just use the inspiration, nothing else.

OK ... I've always been thinking that the computer technologies we currently use, can't give us enough power to build something as intelligent as a biological creature. We are quite far from even building something like a mosquito, why? Because I think we have chosen not quite right path from the start.

Sparse Matrix Storage

Why do we need to work with sparse matrices or data representation? Look at the following video from "Max Planck institute of brain research". It shows the working neurons of a small part of a mouse cortex. While there are millions of neurons at this part of the cortex, only small amount of them is active at each time even when the brain are doing something complex. Yes, this is the way brain processes and works with data, you can assume these neurons as bits of nonzero information and the whole surface and the rest as the zero elements of a matrix or memory in every single snapshot.

Active neurons in small part of the mouse cortex
(Max Planck institute of brain research)

A dense matrix has mostly nonzero elements so in order to design a storage for it, it is better to store it like a typical RAM storage we talked about in the previous post. So an m × n matrix of integer numbers can be stored in a RAM, so each element address (i, j) can be represented as a single address (m-1) × i + j. You have to remember that memory structure in a computer is in fact like a matrix, there are banks of memory so you can assume that each bank is something like a row of a matrix too.

Search in Random Access Memory

Image from www.shutterstock.com

Random access memory is simply an array of directly addressable elements like Bytes, Words, Longs, ... or even Strings and ... Each memory element is directly addressable by its distance from the first element, this is what we get used to it and picture it as arrays in programming languages. If you are a database programmer then you may simply show it by a two columns table, the first column as the primary key or address of the element and the second column as the content of the element.

So if you have an 8 bits address register, then you can address up to 2^8=256 elements, for 32 bits you can address up to 2^32=4G elements and for a 64 bits you can have up to 2^64=16 E (Exa) elements. It means that a typical 64 bit processor can address up to 16E Bytes of memory in theory, although they don't do it (I think most of them are using 48 bits at the moment).

Sparse Distributed Representation

I don't know why, but we have learned that we somehow always are in out of storage memory or disk space situation. We have learned we always have to compress data, for example, to show 256 simple ASCII characters all we need is having 8 bits. Even this much of representation doesn't satisfy us and we go for compressing data using mathematics formulas and ....

I never forget the time I started to study and work on designing databases, tables, fields ... all was about normalizing and minimizing sizes as much as possible. Why should you keep duplicate information? This is a bad idea to have some pre-calculated fields in tables! and ... But today, you can't have applications with tons of data and insisting on having normalized data design or even relational design.

In fact, you may have heard that relational database design is better for programmers, make programming easier, while if your concern is dealing with huge amount of data, then the relational model doesn't work and you need to go for something else. Even the way we are keeping information in memory.

5 active bits in dense and sparse memory.

Bug and instability propagation, working with numbers

Look at the following numbers which show the number of lines in some famous software applications, I've got them from informationisbeautiful.net:

Average iPhone app:       50,000 = 50K
Camino Web browser:      200,000 = 200K
Photoshop CS6:       4,500,000 = 4.5M

Google Chrome: 5,000,000 = 5.0M

Fire Fox: 9,500,000 = 9.5M

MySQL: 12,000,000 = 12.0M

Boeing 787:         14,000,000 = 14.0M
Apache Open Office:      22,000,000 = 22.0M
Windows 7:        39,500,000 = 39.5M
Large Hadron Collider:   50,000,000 = 50.0M

Facebook: 62,000,000 = 62.0M

Mac OS X Tiger: 85,000,000 = 85.0M

Human Genome: 3,300,000,000 = 3.3T (!!!)

I don't know how much these numbers are accurate, but even if they have 50% of accuracy there is still something we have to think about:

Chain reaction model for bug propagation

In two previous posts, we tried to see the effect of having a modular topology for bug generation in a development cycle. We saw that if we add some new lines to project and edit some previously written lines then we have the following total bugs in the modules we updated:

DDB = ε Σ Ei + v Σ Ni (2)

Σ Ei is all the edited lines count and Σ Ni is the all new lines of code count. ε is the ratio of bugs per edit lines and v is the ratio of bugs per new lines of code. And for neighbors effect for a full mesh we considered jut one level of impact, which was:

IDB = ψ(n-1) (Σ Ei + Σ Ni) (1)

ψ is the ratio of the bugs per newly added neighbor's bug. We also saw in order to bring down the side effects or indirect effects we can have a software module topology design for our application. So for a sample 4 module fully connected mesh in first layer and then another fully connected mesh for these 4Ms we have the indirect bugs as:

The effect of software topology in bug creation

In the last post, we implicitly assumed that we have a fully connected mesh topology in our software modules while there is almost no application in which all modules have a connection to all others.

A not fully connected mesh topology

Although for example in the picture if you consider module M2 doesn't have any direct connection to M3, we can assume effects on M1 caused by M2 can has effect on M3 (via M1), but since we are building a simple model we ignore middle-men effects so we simply accept side neighbors effect at the moment.

We have to find out how we can fix our model to support such a topology. The answer is: WE CAN'T unless we have a model for our software. So let us build a general model for the software.

OK, suppose there is a restriction in the application that any 4 small module can have full mesh connection and the whole application itself which contains many 4M units can have fully connected mesh topology for its 4Ms.

A model for software development bug calculation

The thing I'm going to talk about is just a sample study of why software applications get complex and buggy through the time. I'm trying to build a simple model for software application's bugginess, so we need to have some definitions first and a scenario, let's get started.

First we all know, software applications should evolve because their environment, users, and needs evolve. So we have to modify or add features to applications to support the ecosystem evolution.

There are many metrics to measure the software quality but for our simple model let's just deal with "bug per lines of code". I've seen the average of say 15 bugs per 1000 line of code in papers, this is just a number to have a feeling of what we are talking about, we are going to use symbols instead of numbers. I just want to emphasize that even advanced programmers write code with bugs. We use two type of symbols one for new line of codes and one for edited lines of code, why because the effect of these two on generating bugs are not the same (will talk about it later):

Software longevity: Memory & Hard Disk

If you ask me what are the main factors which don't let even carefully designed, developed and even tested software continue working in long run, I'd say "Memory & Disk". These two factors always get underestimated by programmers or get no attention at all. Remember from the previous post the reason of NASA's Spirit rover problem was a disk full.

Why do we always assume we have the required amount of memory and disk space all the time? Why our tests don't show these problems? There are many reasons let me bring some of them here:

1- Wrong kind of tests: We usually don't test developed software in long term or don't put them under test for weeks or months. This mostly happens when you are new in software testing and don't have enough experience, and you get excited when it works for weeks, this doesn't mean your software doesn't have memory leak or doesn't waste disk space.

2- Don't have enough time: Even if we test the software in months we usually can't test it in years, which many software systems should work in years without restarting. This one happens when you are under pressure to deliver the software as soon as possible.

3- Fear of failure: We get emotionally involved in software development and the developed software itself. This one has good and bad sides, one of the bad sides is that we always have some fear of failure when we try to test the software, so we usually test it gently and as we see some signs of acceptance we finish the test.

4- Not in a real environment: If you think the user puts a dedicated hardware for your software and gives your software enough memory and disk space, you are wrong. They may do this at first but sooner or later they will install other software systems on the machine. Even if we test the software in months and apply any kind of test procedure to it, we usually forget that this is not the real environment the software is going to work.

Software fails, it is inevitable

If you search the web for big and famous software bugs or failures you'll find plenty of them like the followings:

The famous Y2K problem in the year 2000 we experience.
Similar to Y2K, we will face a problem in 2038 in which the Unix timestamp variable which is a 32 bit integer gets full, why just to this simple math 2,147,483,647/3600/24/365 ~ 68 years then it started from 1970 so 1970+68=2038!
The famous northeast blackout of 2003 was the result of a race hazard in monitoring software.
The Mariner 1 crash was because of a wrong piece of the program. The programmer should have used the average speed in a calculation instead of the instant speed.
NASA's Mars Polar Lander loss of communications happened in December 1999 was because of software error.
NASA's Spirit rover got unresponsive after landing on Mars, because of storing too many files on its flash memory. The problem solved just by deleting the unwanted files.
...

Why do such things happen?

We sometimes think this is only us, I mean ordinary programs in ordinary companies who make mistakes or write programs with bugs, but this is not true, don't worry, everybody in every company can make a mistake and makes. There are hundreds of these costly historical mistakes available to read and learn on the internet.

sleptons

Thursday, 26 February 2015

Human neocortex

Tuesday, 24 February 2015

Sparse Matrix Storage

Saturday, 21 February 2015

Search in Random Access Memory

Sparse Distributed Representation

Tuesday, 17 February 2015

Bug and instability propagation, working with numbers

Wednesday, 11 February 2015

Chain reaction model for bug propagation

Tuesday, 10 February 2015

The effect of software topology in bug creation

Monday, 9 February 2015

A model for software development bug calculation

Tuesday, 3 February 2015

Software longevity: Memory & Hard Disk

Monday, 2 February 2015

Software fails, it is inevitable