Blosc: A blocking, shuffling and loss-less compression library
Transmitting data from memory to CPU (and back) faster than a plain memcpy()
You can access the public releases here.
Git Repository and ticketing
The home of the git repository for Blosc is located at:
You can download the sources and file tickets there too.
There is an official Blosc blosc mailing list at:
What Is It?
Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).
It uses the blocking technique (as described in this article) to reduce activity on the memory bus as much as possible. In short, the blocking technique works by dividing datasets in blocks that are small enough to fit in L1 cache of modern processor and perform compression/decompression there. It also leverages SIMD (SSE2) and multi-threading capabilities present in nowadays multi-core processors so as to accelerate the compression/decompression process to a maximum.
Meta-Compression And Other Advantages Over Existing Compressors
Blosc is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different compressors and pre-conditioners (programs that generally improve compression ratio). At any rate, it can also be called a compressor because it happens that it already integrates one compressor and one pre-conditioner, so it can actually work like so.
Currently it uses BloscLZ, a compressor heavily based on FastLZ, and a highly optimized (it can use SSE2 instructions, if available) Shuffle pre-conditioner. However, different compressors or pre-conditioners may be added in the future.
Blosc is in charge of coordinating the compressor and pre-conditioners so that they can leverage the blocking technique (described above) as well as multi-threaded execution (if several cores are available) automatically. That makes that every compressor and pre-conditioner will work at very high speeds, even if it was not initially designed for doing blocking or multi-threading.
Other advantages of Blosc are:
- Meant for binary data: can take advantage of the type size meta-information for improved compression ratio (using the integrated shuffle pre-conditioner).
- Small overhead on non-compressible data: only a maximum of 16 additional bytes over the source buffer length are needed to compress every input.
- Maximum destination length: contrarily to many other compressors, both compression and decompression routines have support for maximum size lengths for the destination buffer.
- Replacement for memcpy(): it supports a 0 compression level that does not compress at all and only adds 16 bytes of overhead. In this mode Blosc can copy memory usually faster than a plain memcpy().
When taken together, all these features set Blosc apart from other similar solutions.
Where Blosc Can Be Used?
Blosc was initially developed for the needs of the PyTables database and the carray project, although it may be used elsewhere and in fact, a new ecosystem around it is slowly appearing around it ( Bloscpack, Blaze). It is expected to allow I/O performance to go beyond expected limits and also to perform arithmetic (for example, see http://pytables.org/moin/ComputingKernel) and indexing and query operations in extremely large datasets well beyond the speed of more traditional approaches.
Is It Ready For Production Use?
Yup. It is.
The PyTables community has contributed testing Blosc very hard, and I'm happy to say that, since 0.9.5 on, it worked flawlessly compressing and decompressing hundreds of Terabytes (coming from the hardsuite and extremesuite in included benchmark) on many different Windows and Unix boxes, both in 32-bit and 64-bit. Of course, that does not mean that Blosc doesn't contain bugs, but just that grave bugs are unlikely.
Moreover, with the introduction of Blosc 1.0, it has been declared stable, and both the API and the format have been frozen, so you should expect a large degree of stability for your Blosc-powered apps.
You can find a Python package that wraps Blosc at:
Command line interface
Bloscpack is a nice command line tool that uses Blosc for compressing existing binary datasets on-disk:
Although the format is still unstable, Bloscpack is making great strides for becoming a great utility. Try it out!
Want To Contribute?
Your cooperation is very important to make Blosc as solid as possible so, if you detect some bug or want to propose an enhancement, feel free to open a new ticket. Also, you can contribute to this project by simply compiling and running different benchmark and test suites in your platform, as explained in the SyntheticBenchmarks page.
Blosc is free software and released under the terms of the very permissive MIT license, so you can use it in almost any way you want!
About This Site
This is a place where you can have a look at the Blosc sources, download patches, view existing (open or already closed) tickets and file new tickets. Its goal is to simplify effective tracking and handling of blosc issues, enhancements and overall progress.
Important note: In order to prevent the spam, you must observe the next requisites before modifying things in this site:
- You need to enable cookies to access some parts of this site; otherwise, you may trigger spam protection and get a "Payment Required" page. Sorry for the inconvenience.
As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.