wiki:WikiStart

Blosc: A blocking, shuffling and loss-less compression library

Transmitting data from memory to CPU (and back) faster than a plain memcpy(). Wanna fly?

Git repository, downloads and ticketing

The home of the git repository for Blosc is located at:

 https://github.com/Blosc

You can download the sources and file tickets there too.

Mailing list

There is an official Blosc blosc mailing list at:

 http://groups.google.com/group/blosc

What Is It?

Blosc is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).

It uses the blocking technique (as described in this  article) to reduce activity on the memory bus as much as possible. In short, the blocking technique works by dividing datasets in blocks that are small enough to fit in L1 cache of modern processor and perform compression/decompression there. It also leverages SIMD (SSE2) and multi-threading capabilities present in nowadays multi-core processors so as to accelerate the compression/decompression process to a maximum.

You may want to see more info about Blosc in the last part of this  presentation. You can see some benchmarks in SyntheticBenchmarks.

Meta-Compression And Other Advantages Over Existing Compressors

Blosc is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different compressors and pre-conditioners (programs that generally improve compression ratio). At any rate, it can also be called a compressor because it happens that it already integrates at least one compressor and one pre-conditioner, so it can actually work like so.

Currently, Blosc uses by default blosclz, a compressor heavily based on  FastLZ. From version 1.3 on, Blosc also includes support for  lz4 and lz4hc,  snappy and  zlib. Also, it comes with a highly optimized (it can use SSE2 instructions, if available) shuffle pre-conditioner. However, different compressors or pre-conditioners may be added in the future.

Blosc is in charge of coordinating the compressor and pre-conditioners so that they can leverage the blocking technique (described above) as well as multi-threaded execution (if several cores are available) automatically. That makes that every compressor and pre-conditioner will work at very high speeds, even if it was not initially designed for doing blocking or multi-threading.

Other advantages of Blosc are:

  • Meant for binary data: can take advantage of the type size meta-information for improved compression ratio (using the integrated shuffle pre-conditioner).
  • Small overhead on non-compressible data: only a maximum of 16 additional bytes over the source buffer length are needed to compress every input.
  • Maximum destination length: contrarily to many other compressors, both compression and decompression routines have support for maximum size lengths for the destination buffer.
  • Replacement for memcpy(): it supports a 0 compression level that does not compress at all and only adds 16 bytes of overhead. In this mode Blosc can copy memory usually faster than a plain memcpy().

When taken together, all these features set Blosc apart from other similar solutions.

Where Blosc Can Be Used?

Blosc was initially developed for the needs of the  PyTables database and the  blz project, although it may be used elsewhere and in fact, a new ecosystem around it is slowly appearing around it (like the excellent  Bloscpack format and compressor). It is expected to allow I/O performance to go beyond expected limits and also to perform arithmetic (for example, see  http://pytables.org/moin/ComputingKernel) and indexing and query operations in extremely large datasets well beyond the speed of more traditional approaches.

Is It Ready For Production Use?

Yup. It is.

The PyTables community has contributed testing Blosc very hard, and I'm happy to say that, since 0.9.5 on, it worked flawlessly compressing and decompressing hundreds of Terabytes (coming from the hardsuite and extremesuite in included benchmark) on many different Windows and Unix boxes, both in 32-bit and 64-bit. Of course, that does not mean that Blosc doesn't contain bugs, but just that grave bugs are unlikely.

Moreover, with the introduction of Blosc 1.0, it has been declared stable, and both the API and the format have been frozen, so you should expect a large degree of stability for your Blosc-powered apps.

Python wrapping

You can find a Python package that wraps Blosc at:

 http://github.com/ContinuumIO/python-blosc

Command line interface

Bloscpack is a nice command line tool that uses Blosc for compressing existing binary datasets on-disk:

 https://github.com/esc/bloscpack

Although the format is still unstable, Bloscpack is making great strides for becoming a great utility. Try it out!

Want To Contribute?

Your cooperation is very important to make Blosc as solid as possible so, if you detect some bug or want to propose an enhancement, feel free to open a new ticket. Also, you can contribute to this project by simply compiling and running different benchmark and test suites in your platform, as explained in the SyntheticBenchmarks page.

Blosc License

Blosc is free software and released under the terms of the very permissive  MIT license, so you can use it in almost any way you want!

About This Site

This is a place where you can have a look at the Blosc sources, download patches, view existing (open or already closed) tickets and file new tickets. Its goal is to simplify effective tracking and handling of blosc issues, enhancements and overall progress.

Important note: In order to prevent the spam, you must observe the next requisites before modifying things in this site:

  • You need to enable cookies to access some parts of this site; otherwise, you may trigger spam protection and get a "Payment Required" page. Sorry for the inconvenience.

As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.

Enjoy data!
Francesc Alted

Attachments