| 1 | =============================================================== |
|---|
| 2 | Blosc: A blocking, shuffling and lossless compression library |
|---|
| 3 | =============================================================== |
|---|
| 4 | |
|---|
| 5 | :Author: Francesc Alted i Abad |
|---|
| 6 | :Contact: faltet@pytables.org |
|---|
| 7 | :URL: http://blosc.pytables.org |
|---|
| 8 | |
|---|
| 9 | |
|---|
| 10 | What is it? |
|---|
| 11 | =========== |
|---|
| 12 | |
|---|
| 13 | Blosc [1]_ is a high performance compressor optimized for binary data. |
|---|
| 14 | It has been designed to transmit data to the processor cache faster |
|---|
| 15 | than the traditional, non-compressed, direct memory fetch approach via |
|---|
| 16 | a memcpy() OS call. Blosc is the first compressor (that I'm aware of) |
|---|
| 17 | that is meant not only to reduce the size of large datasets on-disk or |
|---|
| 18 | in-memory, but also to accelerate memory-bound computations. |
|---|
| 19 | |
|---|
| 20 | It uses the blocking technique (as described in [2]_) to reduce |
|---|
| 21 | activity on the memory bus as much as possible. In short, this |
|---|
| 22 | technique works by dividing datasets in blocks that are small enough |
|---|
| 23 | to fit in caches of modern processors and perform compression / |
|---|
| 24 | decompression there. It also leverages, if available, SIMD |
|---|
| 25 | instructions (SSE2) and multi-threading capabilities of CPUs, in order |
|---|
| 26 | to accelerate the compression / decompression process to a maximum. |
|---|
| 27 | |
|---|
| 28 | You can see some recent benchmarks about Blosc performance in [3]_ |
|---|
| 29 | |
|---|
| 30 | Blosc is distributed using the MIT license, see LICENSES/BLOSC.txt for |
|---|
| 31 | details. |
|---|
| 32 | |
|---|
| 33 | .. [1] http://blosc.pytables.org |
|---|
| 34 | .. [2] http://www.pytables.org/docs/CISE-12-2-ScientificPro.pdf |
|---|
| 35 | .. [3] http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks |
|---|
| 36 | |
|---|
| 37 | |
|---|
| 38 | Meta-compression and other advantages over other existing compressors |
|---|
| 39 | ==================================================================== |
|---|
| 40 | |
|---|
| 41 | Blosc is not like other compressors: it should rather be called a |
|---|
| 42 | meta-compressor. This is so because it can use different compressors |
|---|
| 43 | and pre-conditioners (programs that generally improve compression |
|---|
| 44 | ratio). Anyway, it can also be called a compressor because it happens |
|---|
| 45 | that it already integrates one compressor and one pre-conditioner, so |
|---|
| 46 | it can actually work like so. |
|---|
| 47 | |
|---|
| 48 | Currently it uses BloscLZ, a compressor heavily based on FastLZ |
|---|
| 49 | (http://fastlz.org/), and a highly optimized (it can use SSE2 |
|---|
| 50 | instructions, if available) Shuffle pre-conditioner. However, |
|---|
| 51 | different compressors or pre-conditioners may be added in the future. |
|---|
| 52 | |
|---|
| 53 | Blosc is in charge of coordinating the compressor and pre-conditioners |
|---|
| 54 | so that they can leverage the blocking technique (described above) as |
|---|
| 55 | well as multi-threaded execution (if several cores are available) |
|---|
| 56 | automatically. That makes that every compressor and pre-conditioner |
|---|
| 57 | will work at very high speeds, even if it was not initially designed |
|---|
| 58 | for doing blocking or multi-threading. |
|---|
| 59 | |
|---|
| 60 | Other advantages of Blosc are: |
|---|
| 61 | |
|---|
| 62 | * Meant for binary data: can take advantage of the type size |
|---|
| 63 | meta-information for improved compression ratio (using the |
|---|
| 64 | integrated shuffle pre-conditioner). |
|---|
| 65 | |
|---|
| 66 | * Small overhead on non-compressible data: only a maximum of 16 |
|---|
| 67 | additional bytes over the source buffer length are needed to |
|---|
| 68 | compress *every* input. |
|---|
| 69 | |
|---|
| 70 | * Maximum destination length: contrarily to many other |
|---|
| 71 | compressors, both compression and decompression routines have |
|---|
| 72 | support for maximum size lengths for the destination buffer. |
|---|
| 73 | |
|---|
| 74 | When taken together, all these features set Blosc apart from other |
|---|
| 75 | similar solutions. |
|---|
| 76 | |
|---|
| 77 | |
|---|
| 78 | Compiling your application with Blosc |
|---|
| 79 | ===================================== |
|---|
| 80 | |
|---|
| 81 | Blosc consists of the next files (in blosc/ directory): |
|---|
| 82 | |
|---|
| 83 | blosc.h and blosc.c -- the main routines |
|---|
| 84 | blosclz.h and blosclz.c -- the actual compressor |
|---|
| 85 | shuffle.h and shuffle.c -- the shuffle code |
|---|
| 86 | |
|---|
| 87 | Just add these files to your project in order to use Blosc. For |
|---|
| 88 | information on compression and decompression routines, see blosc.h. |
|---|
| 89 | |
|---|
| 90 | To compile using GCC/MINGW (4.4 or higher recommended): |
|---|
| 91 | |
|---|
| 92 | gcc -O3 -msse2 -o myprog myprog.c blosc/*.c -lpthread |
|---|
| 93 | |
|---|
| 94 | Using Windows and MSVC (2008 or higher recommended): |
|---|
| 95 | |
|---|
| 96 | cl /Ox /Femyprog.exe myprog.c blosc\*.c /link pthreadvc2.lib |
|---|
| 97 | |
|---|
| 98 | [remember to set the LIB and INCLUDE environment variables to |
|---|
| 99 | pthread-win32 directories first] |
|---|
| 100 | |
|---|
| 101 | A simple usage example is the benchmark in the bench/bench.c file. |
|---|
| 102 | Also, another example for using Blosc as a generic HDF5 filter is in |
|---|
| 103 | the hdf5/ directory. |
|---|
| 104 | |
|---|
| 105 | I have not tried to compile this with compilers other than GCC, MINGW, |
|---|
| 106 | Intel ICC or MSVC yet. Please report your experiences with your own |
|---|
| 107 | platforms. |
|---|
| 108 | |
|---|
| 109 | |
|---|
| 110 | Testing Blosc |
|---|
| 111 | ============= |
|---|
| 112 | |
|---|
| 113 | Go to the test/ directory and issue: |
|---|
| 114 | |
|---|
| 115 | $ make test |
|---|
| 116 | |
|---|
| 117 | These tests are very basic, and only valid for platforms where GNU |
|---|
| 118 | make/gcc tools are available. If you really want to test Blosc the |
|---|
| 119 | hard way, look at: |
|---|
| 120 | |
|---|
| 121 | http://blosc.pytables.org/trac/wiki/SyntheticBenchmarks |
|---|
| 122 | |
|---|
| 123 | where instructions on how to intensively test (and benchmark) Blosc |
|---|
| 124 | are given. If while running these tests you get some error, please |
|---|
| 125 | report it back! |
|---|
| 126 | |
|---|
| 127 | |
|---|
| 128 | Filter for HDF5 |
|---|
| 129 | =============== |
|---|
| 130 | |
|---|
| 131 | For those that want to use Blosc as a filter in the HDF5 library, |
|---|
| 132 | there is an implementation in the hdf5/ directory. |
|---|
| 133 | |
|---|
| 134 | |
|---|
| 135 | Acknowledgments |
|---|
| 136 | =============== |
|---|
| 137 | |
|---|
| 138 | I'd like to thank the PyTables community that have collaborated in the |
|---|
| 139 | exhaustive testing of Blosc. With an aggregate amount of more than |
|---|
| 140 | 300 TB of different datasets compressed *and* decompressed |
|---|
| 141 | successfully, I can say that Blosc is pretty safe now and ready for |
|---|
| 142 | production purposes. |
|---|
| 143 | |
|---|
| 144 | |
|---|
| 145 | ---- |
|---|
| 146 | |
|---|
| 147 | **Enjoy data!** |
|---|