Review Board 2.0.15


stats: Store vector stats using doubles and compress with zlib

Review Request #1646 - Created Jan. 15, 2013 and updated

Information
Andreas Hansson
gem5
default
Reviewers
Default
Changeset 9499:bc23f2c316fc
---------------------------
stats: Store vector stats using doubles and compress with zlib

This patch changes any arrays of values to be stored as an array of doubles,
rather than floats in the SQL database. This is required as floats lose too much
accuracy. For example, if the stats are read from the database, and injected
back into gem5's stats system, then formulas can be recalculated. If floats are
used, these formulas evaluate to be different from those originally calculated
when creating the SQL database.

As doubles take up twice the space of a float (8 Bytes vs 4 Bytes) the SQL
database becomes larger. The end result is that the database is larger than the
text based output without compression. Therefore, as the vector storage is
already not human readable we compress this field using zlib. zlib has been in
the python standard library since version 1.5.1. so it is already covered in
the gem5 build prerequisites.

   
Posted (Jan. 25, 2013, 1:58 a.m.)
A double is 8 bytes, and each character in a text-based output is 
probably >= 1 byte, depending on the encoding. If the double value
actually holds less than 8 characters, I am surprised that a
float value does not suffice. What other info does the database 
include that is increasing its size?
  1. The reason for switching from float to double was due to inaccuracies when formulas were recalculated.
    
    I wrote a script which takes the stats from the SQL database and injects them back into the gem5 python stats system. This allowed me to generate a text-based stats file and an SQLite database for a gem5 run, then inject the data back into the stats system and re-generate the text-based output to ensure that the stats were being stored and retrieved correctly, i.e. that the original stats.txt matched the one generated from the SQLite database. When floats were used to store the data in the database, some of the formulas evaluated to significantly different results as some of the accuracy was lost when storing. This issue was resolved when changing the storage to double as python's "float" is actually 64 bits (on most architectures/python implementations).
    
    However, in order to minimise the number of database accesses, vector stats (vector, vector2d and formulas) are stored as binary blobs in the database, thereby storing all elements of the vector in one field in the database. However, this has the side effect that if you have, for example, a vector of length 10 with one actual value and nine NaNs, you still have to store the NaNs. Naturally, if you then double the space to store each value (including the NaNs) the database becomes very large.
    
    In my view there are two alternatives to the approach in the patch:
    
    1. Store each element for a vector in a separate table, and "reconstruct" the vector when we want the values. This has two side effects. First of all, each access requires multiple database access, or complex joining of tables which will increase the access time. Secondly, if each element is stored by itself it also need to be stored with the ID of the stat it belongs to, the index of the dump it belongs to and its position within the vector. This potentially requires more space to store than the approach in this patch. That said, it would allow only specific elements of the vector to be pulled from the database.
    
    2. Manually pack the data into the blob field. We could only store the data which is non-NaN by manually packing the data so that we store <index within vector><value as double>. This has the advantage of only storing the data we care about (although we have the additional overhead of storing the index within the vector) and we could pull this data out with one database access. However, we do then have the overhead of packing and unpacking the data which is potentially very slow and time consuming.
    
    Personally I don't think that any of these solutions are ideal, but I think that the solution in the patch presents a fairly foolproof way of storing the data. Of course, I am more than open to suggestions, but I think it will always be a trade-off between elegance, size, speed and accuracy.
  2. The second approach is what I would personally prefer. It is pretty common to
    store sparse matrices / vectors that way. Note that even compression is 'slow
    and time consuming'. But I'll let you decide the approach you want to take.
Posted (Jan. 25, 2013, 8:13 a.m.)
Seems like overkill to me.  If you do this, then you can't do any math using SQL and you have to suck out values to do anything.  If that's the attitude, why even bother using sqlite at all?
  1. You can't do math in sql, but that probably wasn't what you wanted to do anyway. You probably want to suck the data back in the python class hierarchy and manipulate it there. I think the ideal situation would be to pickle the objects and not use sql, however that was much slower. The slowest (and largest) was having a sql table of stat,x,y,value columns which meant reading a large array took forever. 
  2. Interesting.  When I was doing tons of sampling, doing the math in SQL was exactly what I wanted to do because I could do queries in moments compared to loading several gigabytes of data and then processing it.  All of the context stuff and the stuff in util/stats/db.py was to do that.  The nice thing about the database is that you can build up a very large database of stats across many experiments that have many samples, and with SQL, you can really quickly query those stats.  If you're just trying to have something be a binary format, you may as well just serialize as json (or msgpack) and gzip the whole file.  I, personally, found the SQL thing to be awesome.  I could regenerate complex graphs in moments.  (Not to mention the fact that SQL actually implements tons of useful operations.)
  3. The binary data stored in SQL is a sensible middle ground at this point as you can avoid the scenario you describe of having to unzip/unserialize the whole file, and can simply get the data you need through queries. Then you will indeed have to unzip/unserialize those bits before you can manipulate them, but the benefit is that the size of the database is manageable.
    
    We tried a range of options and this seemed like a sensible starting point. If someone wants to extend or modify it going forward that is of course very welcome.
  4. Personally, I don't think storing blobs in sqlite is particularly sensible.  It is seriously limiting.  If sqlite is to be the canonical storage format, it seems that it should be simple and obvious.  If you have particular storage/speed issues, then a secondary implementation might make sense (but don't call it sql since you're just using it for storage, not for SQL).  Then again, if you want to store blobs, why are you using sqlite at all? Why not use dbm?
    
    Is the problem space or speed?  If the problem is speed, what operations are you doing?  If you're simply converting to text, then I'd say that's not a useful benchmark.