|Version 24 (modified by faltet, 3 years ago) (diff)|
Blosc: A blocking, shuffling and loss-less compression library
Transmitting data from memory to CPU (and back) faster than a plain memcpy()
What Is It?
Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).
It uses the blocking technique (as described in this article) to reduce activity on the memory bus as much as possible. In short, the blocking technique works by dividing datasets in blocks that are small enough to fit in L1 cache of modern processor and perform compression/decompression there. It also leverages SIMD (SSE2) and multi-threading capabilities present in nowadays multi-core processors so as to accelerate the compression/decompression process to a maximum.
Meta-Compression And Other Advantages Over Other Compressors
Blosc is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different compressors and pre-conditioners (programs that generally improve compression ratio). And it happens that it already supports one compressor and one pre-conditioner.
Currently it uses BloscLZ, a compressor heavily based on FastLZ, and a highly optimized (it can use SSE2 instructions, if available) Shuffle pre-conditioner. However, different compressors or pre-conditioners may be added in the future.
Blosc is in charge of coordinating the compressor and pre-conditioners so that they run via the blocking technique (described above) automatically as well as using multi-threading (if several cores are available). That makes that every compressor and pre-conditioner could work at very high speeds, even if it was not initially designed for doing blocking or multi-threading.
Other advantages of Blosc are:
- Meant for binary data: can take advantage of the type size meta-information for improved compression ratio (using the integrated shuffle pre-conditioner).
- Small overhead on non-compressible data: only a maximum of 16 additional bytes over the source buffer length are needed to compress every input.
- Maximum destination length: contrarily to many other compressors, both compression and decompression routines have support for maximum size lengths for the destination buffer. So, if the buffer does not have enough capacity for keeping the output of the compress / decompress routines, they will return without any further side-effects.
All these features set Blosc apart from other similar solutions.
Where Blosc Can Be Used?
Blosc is being developed mainly for the needs of the PyTables database, although it may be used elsewhere. It is expected to allow PyTables to perform arithmetic (for example, see http://pytables.org/moin/ComputingKernel) and indexing and query operations in extremely large datasets well beyond the speed of more traditional approaches (like memmap'ed access to files, or uncompressed on-disk indexes).
Is It Ready For Production Use?
No, not yet, so please be careful when using it. Being said this, since 0.8 version the format has been frozen, so at least it is guaranteed that it will not change in a long while. The API has been frozen in release 0.9.5 too. The only part that remains is testing Blosc extensively and broadely.
Part of the PyTables community is currently testing Blosc very hard now, and I'm happy to say that, since 0.9.5 on, it worked flawlessly compressing and decompressing hundreds of terabytes on many different Windows and Unix boxes, both in 32-bit and 64-bit. When all this test process would end (very soon now), that will mark the beginning of the 1.x series.
Want To Contribute?
Your cooperation is very important to make Blosc stable as soon as possible so, if you detect some bug or want to propose an enhancement, feel free to open a new ticket. Also, you can contribute to this project by simply compiling and running different benchmark and test suites as explained in the SyntheticBenchmarks page.
Blosc is free software and released under the terms of the very permissive MIT license, so you can use it in almost any way you want!
The root of the subversion repository for Blosc is located at:
There is not a source tarball as such yet. I'll provide one once Blosc will be declared stable.
About This Site
This is a place where you can have a look at the Blosc sources, download patches, view existing (open or already closed) tickets and file new tickets. Its goal is to simplify effective tracking and handling of blosc issues, enhancements and overall progress.
Important note: In order to prevent the spam, you must observe the next requisites before modifying things in this site:
- You need to enable cookies to access some parts of this site; otherwise, you may trigger spam protection and get a "Payment Required" page. Sorry for the inconvenience.
As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.