Tuesday, 24 February 2015

Sparse Matrix Storage

Why do we need to work with sparse matrices or data representation? Look at the following video from "Max Planck institute of brain research". It shows the working neurons of a small part of a mouse cortex. While there are millions of neurons at this part of the cortex, only small amount of them is active at each time even when the brain are doing something complex. Yes, this is the way brain processes and works with data, you can assume these neurons as bits of nonzero information and the whole surface and the rest as the zero elements of a matrix or memory in every single snapshot.  


Active neurons in small part of the mouse cortex
(Max Planck institute of brain research)

A dense matrix has mostly nonzero elements so in order to design a storage for it, it is better to store it like a typical RAM storage we talked about in the previous post. So an m × n matrix of integer numbers can be stored in a RAM, so each element address (i, j) can be represented as a single address (m-1) × i + j. You have to remember that memory structure in a computer is in fact like a matrix, there are banks of memory so you can assume that each bank is something like a row of a matrix too.


But in a sparse matrix, especially if the size of the matrix gets huge like our brain memory size, it is not efficient to store the matrix like an ordinary dense matrix, I mean array like. The storage difference is at something like 0.98 less, for a whole memory size of 1T byte it is like 200G byte vs 1T byte!!

There are many already presented ways you can store a sparse matrix, and at the core of the all of them, you see this idea, "just store the nonzero elements". But some of them give you faster storage, some give you faster data retrieval and ...

Dictionary
This is in fact just a list of pairs like <key, value>. key is the address and the value is the element content. If you can't find an address in the dictionary you can assume it zero or empty. An ordinary list with no other service but insert, delete and read gives you just a fast way to keep the list (or memory) update but searching is a problem here. 

Note that you may think you can implement it a some hash map, yes you can, but consider first hash maps use algorithms to index or search for a match and the size of the map your are going to work with it is an issue. Do you think this map supports this size of human memory? 

List of List
Another variation of the dictionary to keep nonzero elements is List of List. This means having one list of <column, value> for each item in a row list. This may give you faster access because you don't need to decode the virtual address to physical matrix address, you already have row number and can quickly find the row, then you have to just find the column in a list and then its corresponding element.

3 Separate array 
For a m × n matrix, you may use three separate arrays for keeping nonzero element info, one for row number, one for column number and one for value. This is a fast method for inserting, and looping through the elements as they already inserted.

Compressed Row Storage (CRS)
Let us describe this one by an example, consider the following matrix:

[
  [1, 0, 0, 0, 0, 7, 0],
  [0, 0, 0, 0, 4, 0, 0],
  [0, 0, 0, 0, 0, 0, 9],
  [0, 0, 3, 0, 0, 0, 0],
  [0, 5, 0, 0, 3, 0, 0]
]

To store this, we simply store all values in an array and their corresponding columns in another one. To identify the rows, we just keep the index of "vals" array in which the column changes. If you have a row of zero then you may add same "vals" index two times to move row pointer by 2 rows.

vals = [1, 7, 4, 9, 3, 5, 3]
cols = [1, 6, 5, 7, 3, 2, 5]
rows = [1, 3, 4, 5, 6]

Compressed Column Storage (CCS)
Same as CRS but this time we keep all rows number and try to keep as less as possible data for columns.

Choosing a right storage type is depended on the position of distribution of the nonzero elements and size of the matrix. For example, if the number of rows is much bigger than columns you may better use CRS.