snappy streaming codec summary

Posted by Innovative_Andy on Sun, 31 Oct 2021 11:03:42 +0100


Snappy is Google's open source program library for fast data compression and decompression. Its goal is not to achieve the maximum compression rate, but to achieve very high compression speed and reasonable compression rate at the same time. Snappy is widely used in Google's internal and open source projects, such as Hadoop, leveldb and spark. Official library address:

Due to the needs of recent work, it is found that snappy also implements two variants, which respectively realize the streaming encoding and decoding of snappy algorithm in local file system and hadoop. Hereinafter, they are respectively called snappy stream code and hadoop snappy stream codec to distinguish them from raw snappy codec

  • raw snappy codec: used to encode and decode the whole file or the whole block of data

  • Stream snappy codec: stream snappy codec. The concept of chunk is added to raw snappy codec. The minimum granularity of encoding and decoding is chunk.

  • hadoop stream snappy codec: stream codec. The concept of block is added on the basis of raw snappy codec. The minimum granularity of codec is block

So why do we need snappy streaming codec?

Because if there is no streaming codec algorithm, to decompress snappy data, you must read all the data into memory. If the file is very large, the memory may not be enough. Therefore, we need a streaming codec algorithm to read the data of a frame from the file stream each time, decompress, read and decompress, so that the memory occupation is not particularly large.

The official google Library of snappy does not implement the above streaming codec, while python snappy implements:

Here is an example

#/usr/bin/env python
# coding: utf-8
import snappy
text_file = "1.txt"
snappy_file = "1.snappy"
with open (text_file, "r") as input_file:
    uncompressed_data ='utf-8')
    # raw snappy codec
    compressed_data = snappy.compress(uncompressed_data)
    assert uncompressed_data == snappy.uncompress(compressed_data)
    # stream snappy codec
    c = snappy.StreamCompressor()
    d = snappy.StreamDecompressor()
    compressed_data = c.compress(uncompressed_data)
    assert uncompressed_data == d.decompress(compressed_data)
    # hadoop stream snappy codec
    c = snappy.hadoop_snappy.StreamCompressor()
    d = snappy.hadoop_snappy.StreamDecompressor()
    compressed_data = c.compress(uncompressed_data)
    assert uncompressed_data == d.decompress(compressed_data)

raw snappy codec

The algorithm is complex. Please refer to:

stream snappy codec

The file compressed by stream snappy codec algorithm consists of a series of chunks: chunk | chunk | chunk |... | chunk

Composition of each chunk: header | body

Composition of chunk header: chunk_type(1B) | chunk_size(3B)     Where chunk_size is the length of the body part in chunk, in byte s

chunk_ The value of type is as follows

  • stream identifier(chunk_type = 0xff). In compressed files, the first chunk must be of this type. chunk size = 6, body = “sNaPpY”

  • compressed data(chunk_type = 0x00). Contains actual data. body composition: crc32 check code (4B) | compressed_data. The crc32 check code is extracted and compressed using raw snappy codec_ Uncompressed after data_ crc check code of data. Note: size (compressed_data) < = 2 ^ 24-1, size (uncompressed_data) < = 65535

  • uncompressed_data(chunk_type = 0x01). Contains actual data. body composition: crc32 check code (4B) | uncompressed_data. Where crc32 check code is uncompressed after_ crc check code of data. Note: size (uncompressed_data) < = 65535

  • padding(0xfe). Mainly used for zero filling alignment. body = 00000…

  • Reserved unspippable chunks (chunk_types between 0x02-0x7f) the chunk types reserved for future extensions. When the decoder encounters this chunk type, it should immediately report an error

  • Reserved skippable chunks (chunk types between 0x80-0xfd) are reserved for future extended chunk types. When the decoder encounters such chunks, it should skip immediately,

The crc32 verification algorithm is as follows:

def _masked_crc32c(data):
    # see the framing format specification
    crc = _crc32c(data)
    return (((crc >> 15) | (crc << 17)) + 0xa282ead8) & 0xffffffff

This encoding format ensures that the two stream snappy encoded files are still legal compressed files after being merged through the linux cat command.

For more details, see:

Implementation of decompression algorithm in snappy python Library:

def decompress(self, data):
    """Decompress 'data', returning a string containing the uncompressed
    data corresponding to at least part of the data in string. This data
    should be concatenated to the output produced by any preceding calls to
    the decompress() method. Some of the input data may be preserved in
    internal buffers for later processing.
    uncompressed = bytearray()
    while True:
        if len(self._buf) < 4:
            return bytes(uncompressed)
        chunk_type = struct.unpack("<L", self._buf[:4])[0]
        size = (chunk_type >> 8)
        chunk_type &= 0xff
        if not self._header_found:
            if (chunk_type != _IDENTIFIER_CHUNK or
                    size != len(_STREAM_IDENTIFIER)):
                raise UncompressError("stream missing snappy identifier")
            self._header_found = True
        if (_RESERVED_UNSKIPPABLE[0] <= chunk_type and
                chunk_type < _RESERVED_UNSKIPPABLE[1]):
            raise UncompressError(
                "stream received unskippable but unknown chunk")
        if len(self._buf) < 4 + size:
            return bytes(uncompressed)
        chunk, self._buf = self._buf[4:4 + size], self._buf[4 + size:]
        if chunk_type == _IDENTIFIER_CHUNK:
            if chunk != _STREAM_IDENTIFIER:
                raise UncompressError(
                    "stream has invalid snappy identifier")
        if (_RESERVED_SKIPPABLE[0] <= chunk_type and
                chunk_type < _RESERVED_SKIPPABLE[1]):
        assert chunk_type in (_COMPRESSED_CHUNK, _UNCOMPRESSED_CHUNK)
        crc, chunk = chunk[:4], chunk[4:]
        if chunk_type == _COMPRESSED_CHUNK:
            chunk = _uncompress(chunk)
        if struct.pack("<L", _masked_crc32c(chunk)) != crc:
            raise UncompressError("crc mismatch")
        uncompressed += chunk

The execution process is as follows:

  • First, judge whether the type of the first chunk is stream identifier. If not, the format is illegal and an error is returned. If yes, check whether chunk body is "sNaPpY"

  • Judge whether the chunk type is reserved unspippable. If yes, the format is illegal and an error is returned

  • Judge whether the chunk type is reserved skippable. If yes, skip the current chunk

  • crc check the chunk body

  • If the chunk type is uncompressed_data, it will be uncompressed_data is appended to the decompression result

  • If the chunk type is compressed_data, use raw snappy codec to compress_ Decompress the data to get uncompressed_data is appended to the decompression result

  • After the current chunk processing is completed, the next one is followed until the end of the file

hadoop stream snappy codec

hadoop stream snappy codec is mainly used for snappy encoding and decoding scenarios of HDFS files. Its format is simpler than stream snappy codec

It consists of multiple blocks: block |... | block

Format of each block: total_len(4 bytes) | compressed_len(4 bytes) | compressed_data(compressed_len length, variable length) | compressed_len| compressed_data


  • total_len: the total length of decompression results in the current block

  • compressed_len: corresponding compressed data length

  • compressed_data: compressed data

The decompression algorithm in snappy python library is as follows:

    def decompress(self, data):
        """Decompress 'data', returning a string containing the uncompressed
        data corresponding to at least part of the data in string. This data
        should be concatenated to the output produced by any preceding calls to
        the decompress() method. Some of the input data may be preserved in
        internal buffers for later processing.
        int_size = _INT_SIZE
        self._buf += data
        uncompressed = []
        while True:
            if len(self._buf) < int_size:
                return b"".join(uncompressed)
            next_start = 0
            if not self._block_length:
                self._block_length = unpack_int(self._buf[:int_size])
                self._buf = self._buf[int_size:]
                if len(self._buf) < int_size:
                    return b"".join(uncompressed)
            compressed_length = unpack_int(
                self._buf[next_start:next_start + int_size]
            next_start += int_size
            if len(self._buf) < compressed_length + next_start:
                return b"".join(uncompressed)
            chunk = self._buf[
                next_start:next_start + compressed_length
            self._buf = self._buf[next_start + compressed_length:]
            uncompressed_chunk = _uncompress(chunk)
            self._uncompressed_length += len(uncompressed_chunk)
            if self._uncompressed_length == self._block_length:
                # Here we have uncompressed all subblocks of the current block
                self._uncompressed_length = 0
                self._block_length = 0

The code logic is relatively simple, and the process is as follows:

  • Read block_length and initialize uncompressed_length, which indicates the length of the current decompression result

  • Read the compressed_length and compressed_data, decompress them with raw snappy codec, get uncompressed_data, append it to the output, and update the uncompressed_length

  • When uncompressed_length reaches block_length, it means that the current block has been processed, the internal state is reset and ready to process the next block,   This cycle continues until the end of the file

reference resources

c++ snappy lib:

python snappy lib: