snappy streaming codec summary

Posted by Innovative_Andy on Sun, 31 Oct 2021 11:03:42 +0100

introduce

Snappy is Google's open source program library for fast data compression and decompression. Its goal is not to achieve the maximum compression rate, but to achieve very high compression speed and reasonable compression rate at the same time. Snappy is widely used in Google's internal and open source projects, such as Hadoop, leveldb and spark. Official library address: https://github.com/google/snappy

Due to the needs of recent work, it is found that snappy also implements two variants, which respectively realize the streaming encoding and decoding of snappy algorithm in local file system and hadoop. Hereinafter, they are respectively called snappy stream code and hadoop snappy stream codec to distinguish them from raw snappy codec

raw snappy codec: used to encode and decode the whole file or the whole block of data
Stream snappy codec: stream snappy codec. The concept of chunk is added to raw snappy codec. The minimum granularity of encoding and decoding is chunk.
hadoop stream snappy codec: stream codec. The concept of block is added on the basis of raw snappy codec. The minimum granularity of codec is block

So why do we need snappy streaming codec?

Because if there is no streaming codec algorithm, to decompress snappy data, you must read all the data into memory. If the file is very large, the memory may not be enough. Therefore, we need a streaming codec algorithm to read the data of a frame from the file stream each time, decompress, read and decompress, so that the memory occupation is not particularly large.

The official google Library of snappy does not implement the above streaming codec, while python snappy implements: https://github.com/andrix/python-snappy

Here is an example

#/usr/bin/env python
# coding: utf-8
 
 
import snappy
 
text_file = "1.txt"
snappy_file = "1.snappy"
 
with open (text_file, "r") as input_file:
    uncompressed_data = input_file.read().encode('utf-8')
 
    # raw snappy codec
    compressed_data = snappy.compress(uncompressed_data)
    assert uncompressed_data == snappy.uncompress(compressed_data)
 
    # stream snappy codec
    c = snappy.StreamCompressor()
    d = snappy.StreamDecompressor()
    
    compressed_data = c.compress(uncompressed_data)
    assert uncompressed_data == d.decompress(compressed_data)
 
    # hadoop stream snappy codec
    c = snappy.hadoop_snappy.StreamCompressor()
    d = snappy.hadoop_snappy.StreamDecompressor()
    compressed_data = c.compress(uncompressed_data)
    assert uncompressed_data == d.decompress(compressed_data)

raw snappy codec

The algorithm is complex. Please refer to: https://zzjw.cc/post/snappy/

stream snappy codec

The file compressed by stream snappy codec algorithm consists of a series of chunks: chunk | chunk | chunk |... | chunk

Composition of each chunk: header | body

Composition of chunk header: chunk_type(1B) | chunk_size(3B) Where chunk_size is the length of the body part in chunk, in byte s

chunk_ The value of type is as follows

stream identifier(chunk_type = 0xff). In compressed files, the first chunk must be of this type. chunk size = 6, body = “sNaPpY”
compressed data(chunk_type = 0x00). Contains actual data. body composition: crc32 check code (4B) | compressed_data. The crc32 check code is extracted and compressed using raw snappy codec_ Uncompressed after data_ crc check code of data. Note: size (compressed_data) < = 2 ^ 24-1, size (uncompressed_data) < = 65535
uncompressed_data(chunk_type = 0x01). Contains actual data. body composition: crc32 check code (4B) | uncompressed_data. Where crc32 check code is uncompressed after_ crc check code of data. Note: size (uncompressed_data) < = 65535
padding(0xfe). Mainly used for zero filling alignment. body = 00000…
Reserved unspippable chunks (chunk_types between 0x02-0x7f) the chunk types reserved for future extensions. When the decoder encounters this chunk type, it should immediately report an error
Reserved skippable chunks (chunk types between 0x80-0xfd) are reserved for future extended chunk types. When the decoder encounters such chunks, it should skip immediately,

The crc32 verification algorithm is as follows:

def _masked_crc32c(data):
    # see the framing format specification
    crc = _crc32c(data)
    return (((crc >> 15) | (crc << 17)) + 0xa282ead8) & 0xffffffff

This encoding format ensures that the two stream snappy encoded files are still legal compressed files after being merged through the linux cat command.

For more details, see: https://github.com/google/snappy/blob/master/framing_format.txt

Implementation of decompression algorithm in snappy python Library:

def decompress(self, data):
    """Decompress 'data', returning a string containing the uncompressed
    data corresponding to at least part of the data in string. This data
    should be concatenated to the output produced by any preceding calls to
    the decompress() method. Some of the input data may be preserved in
    internal buffers for later processing.
    """
    self._buf.extend(data)
    uncompressed = bytearray()
    while True:
        if len(self._buf) < 4:
            return bytes(uncompressed)
        chunk_type = struct.unpack("<L", self._buf[:4])[0]
        size = (chunk_type >> 8)
        chunk_type &= 0xff
        if not self._header_found:
            if (chunk_type != _IDENTIFIER_CHUNK or
                    size != len(_STREAM_IDENTIFIER)):
                raise UncompressError("stream missing snappy identifier")
            self._header_found = True
        if (_RESERVED_UNSKIPPABLE[0] <= chunk_type and
                chunk_type < _RESERVED_UNSKIPPABLE[1]):
            raise UncompressError(
                "stream received unskippable but unknown chunk")
        if len(self._buf) < 4 + size:
            return bytes(uncompressed)
        chunk, self._buf = self._buf[4:4 + size], self._buf[4 + size:]
        if chunk_type == _IDENTIFIER_CHUNK:
            if chunk != _STREAM_IDENTIFIER:
                raise UncompressError(
                    "stream has invalid snappy identifier")
            continue
        if (_RESERVED_SKIPPABLE[0] <= chunk_type and
                chunk_type < _RESERVED_SKIPPABLE[1]):
            continue
        assert chunk_type in (_COMPRESSED_CHUNK, _UNCOMPRESSED_CHUNK)
        crc, chunk = chunk[:4], chunk[4:]
        if chunk_type == _COMPRESSED_CHUNK:
            chunk = _uncompress(chunk)
        if struct.pack("<L", _masked_crc32c(chunk)) != crc:
            raise UncompressError("crc mismatch")
        uncompressed += chunk

The execution process is as follows:

First, judge whether the type of the first chunk is stream identifier. If not, the format is illegal and an error is returned. If yes, check whether chunk body is "sNaPpY"
Judge whether the chunk type is reserved unspippable. If yes, the format is illegal and an error is returned
Judge whether the chunk type is reserved skippable. If yes, skip the current chunk
crc check the chunk body
If the chunk type is uncompressed_data, it will be uncompressed_data is appended to the decompression result
If the chunk type is compressed_data, use raw snappy codec to compress_ Decompress the data to get uncompressed_data is appended to the decompression result
After the current chunk processing is completed, the next one is followed until the end of the file

hadoop stream snappy codec

hadoop stream snappy codec is mainly used for snappy encoding and decoding scenarios of HDFS files. Its format is simpler than stream snappy codec

It consists of multiple blocks: block |... | block

Format of each block: total_len(4 bytes) | compressed_len(4 bytes) | compressed_data(compressed_len length, variable length) | compressed_len| compressed_data

among

total_len: the total length of decompression results in the current block
compressed_len: corresponding compressed data length
compressed_data: compressed data

The decompression algorithm in snappy python library is as follows:

    def decompress(self, data):
        """Decompress 'data', returning a string containing the uncompressed
        data corresponding to at least part of the data in string. This data
        should be concatenated to the output produced by any preceding calls to
        the decompress() method. Some of the input data may be preserved in
        internal buffers for later processing.
        """
        int_size = _INT_SIZE
        self._buf += data
        uncompressed = []
        while True:
            if len(self._buf) < int_size:
                return b"".join(uncompressed)
            next_start = 0
            if not self._block_length:
                self._block_length = unpack_int(self._buf[:int_size])
                self._buf = self._buf[int_size:]
                if len(self._buf) < int_size:
                    return b"".join(uncompressed)
            compressed_length = unpack_int(
                self._buf[next_start:next_start + int_size]
            )
            next_start += int_size
            if len(self._buf) < compressed_length + next_start:
                return b"".join(uncompressed)
            chunk = self._buf[
                next_start:next_start + compressed_length
            ]
            self._buf = self._buf[next_start + compressed_length:]
            uncompressed_chunk = _uncompress(chunk)
            self._uncompressed_length += len(uncompressed_chunk)
            uncompressed.append(uncompressed_chunk)
            if self._uncompressed_length == self._block_length:
                # Here we have uncompressed all subblocks of the current block
                self._uncompressed_length = 0
                self._block_length = 0
                continue

The code logic is relatively simple, and the process is as follows:

Read block_length and initialize uncompressed_length, which indicates the length of the current decompression result
Read the compressed_length and compressed_data, decompress them with raw snappy codec, get uncompressed_data, append it to the output, and update the uncompressed_length
When uncompressed_length reaches block_length, it means that the current block has been processed, the internal state is reset and ready to process the next block, This cycle continues until the end of the file

reference resources

c++ snappy lib: https://github.com/google/snappy

python snappy lib: https://github.com/andrix/python-snappy

Programmer Think