introduce
Snappy is Google's open source program library for fast data compression and decompression. Its goal is not to achieve the maximum compression rate, but to achieve very high compression speed and reasonable compression rate at the same time. Snappy is widely used in Google's internal and open source projects, such as Hadoop, leveldb and spark. Official library address: https://github.com/google/snappy
Due to the needs of recent work, it is found that snappy also implements two variants, which respectively realize the streaming encoding and decoding of snappy algorithm in local file system and hadoop. Hereinafter, they are respectively called snappy stream code and hadoop snappy stream codec to distinguish them from raw snappy codec
-
raw snappy codec: used to encode and decode the whole file or the whole block of data
-
Stream snappy codec: stream snappy codec. The concept of chunk is added to raw snappy codec. The minimum granularity of encoding and decoding is chunk.
-
hadoop stream snappy codec: stream codec. The concept of block is added on the basis of raw snappy codec. The minimum granularity of codec is block
So why do we need snappy streaming codec?
Because if there is no streaming codec algorithm, to decompress snappy data, you must read all the data into memory. If the file is very large, the memory may not be enough. Therefore, we need a streaming codec algorithm to read the data of a frame from the file stream each time, decompress, read and decompress, so that the memory occupation is not particularly large.
The official google Library of snappy does not implement the above streaming codec, while python snappy implements: https://github.com/andrix/python-snappy
Here is an example
#/usr/bin/env python # coding: utf-8 import snappy text_file = "1.txt" snappy_file = "1.snappy" with open (text_file, "r") as input_file: uncompressed_data = input_file.read().encode('utf-8') # raw snappy codec compressed_data = snappy.compress(uncompressed_data) assert uncompressed_data == snappy.uncompress(compressed_data) # stream snappy codec c = snappy.StreamCompressor() d = snappy.StreamDecompressor() compressed_data = c.compress(uncompressed_data) assert uncompressed_data == d.decompress(compressed_data) # hadoop stream snappy codec c = snappy.hadoop_snappy.StreamCompressor() d = snappy.hadoop_snappy.StreamDecompressor() compressed_data = c.compress(uncompressed_data) assert uncompressed_data == d.decompress(compressed_data)
raw snappy codec
The algorithm is complex. Please refer to: https://zzjw.cc/post/snappy/
stream snappy codec
The file compressed by stream snappy codec algorithm consists of a series of chunks: chunk | chunk | chunk |... | chunk
Composition of each chunk: header | body
Composition of chunk header: chunk_type(1B) | chunk_size(3B) Where chunk_size is the length of the body part in chunk, in byte s
chunk_ The value of type is as follows
-
stream identifier(chunk_type = 0xff). In compressed files, the first chunk must be of this type. chunk size = 6, body = “sNaPpY”
-
compressed data(chunk_type = 0x00). Contains actual data. body composition: crc32 check code (4B) | compressed_data. The crc32 check code is extracted and compressed using raw snappy codec_ Uncompressed after data_ crc check code of data. Note: size (compressed_data) < = 2 ^ 24-1, size (uncompressed_data) < = 65535
-
uncompressed_data(chunk_type = 0x01). Contains actual data. body composition: crc32 check code (4B) | uncompressed_data. Where crc32 check code is uncompressed after_ crc check code of data. Note: size (uncompressed_data) < = 65535
-
padding(0xfe). Mainly used for zero filling alignment. body = 00000…
-
Reserved unspippable chunks (chunk_types between 0x02-0x7f) the chunk types reserved for future extensions. When the decoder encounters this chunk type, it should immediately report an error
-
Reserved skippable chunks (chunk types between 0x80-0xfd) are reserved for future extended chunk types. When the decoder encounters such chunks, it should skip immediately,
The crc32 verification algorithm is as follows:
def _masked_crc32c(data): # see the framing format specification crc = _crc32c(data) return (((crc >> 15) | (crc << 17)) + 0xa282ead8) & 0xffffffff
This encoding format ensures that the two stream snappy encoded files are still legal compressed files after being merged through the linux cat command.
For more details, see: https://github.com/google/snappy/blob/master/framing_format.txt
Implementation of decompression algorithm in snappy python Library:
def decompress(self, data): """Decompress 'data', returning a string containing the uncompressed data corresponding to at least part of the data in string. This data should be concatenated to the output produced by any preceding calls to the decompress() method. Some of the input data may be preserved in internal buffers for later processing. """ self._buf.extend(data) uncompressed = bytearray() while True: if len(self._buf) < 4: return bytes(uncompressed) chunk_type = struct.unpack("<L", self._buf[:4])[0] size = (chunk_type >> 8) chunk_type &= 0xff if not self._header_found: if (chunk_type != _IDENTIFIER_CHUNK or size != len(_STREAM_IDENTIFIER)): raise UncompressError("stream missing snappy identifier") self._header_found = True if (_RESERVED_UNSKIPPABLE[0] <= chunk_type and chunk_type < _RESERVED_UNSKIPPABLE[1]): raise UncompressError( "stream received unskippable but unknown chunk") if len(self._buf) < 4 + size: return bytes(uncompressed) chunk, self._buf = self._buf[4:4 + size], self._buf[4 + size:] if chunk_type == _IDENTIFIER_CHUNK: if chunk != _STREAM_IDENTIFIER: raise UncompressError( "stream has invalid snappy identifier") continue if (_RESERVED_SKIPPABLE[0] <= chunk_type and chunk_type < _RESERVED_SKIPPABLE[1]): continue assert chunk_type in (_COMPRESSED_CHUNK, _UNCOMPRESSED_CHUNK) crc, chunk = chunk[:4], chunk[4:] if chunk_type == _COMPRESSED_CHUNK: chunk = _uncompress(chunk) if struct.pack("<L", _masked_crc32c(chunk)) != crc: raise UncompressError("crc mismatch") uncompressed += chunk
The execution process is as follows:
-
First, judge whether the type of the first chunk is stream identifier. If not, the format is illegal and an error is returned. If yes, check whether chunk body is "sNaPpY"
-
Judge whether the chunk type is reserved unspippable. If yes, the format is illegal and an error is returned
-
Judge whether the chunk type is reserved skippable. If yes, skip the current chunk
-
crc check the chunk body
-
If the chunk type is uncompressed_data, it will be uncompressed_data is appended to the decompression result
-
If the chunk type is compressed_data, use raw snappy codec to compress_ Decompress the data to get uncompressed_data is appended to the decompression result
-
After the current chunk processing is completed, the next one is followed until the end of the file
hadoop stream snappy codec
hadoop stream snappy codec is mainly used for snappy encoding and decoding scenarios of HDFS files. Its format is simpler than stream snappy codec
It consists of multiple blocks: block |... | block
Format of each block: total_len(4 bytes) | compressed_len(4 bytes) | compressed_data(compressed_len length, variable length) | compressed_len| compressed_data
among
-
total_len: the total length of decompression results in the current block
-
compressed_len: corresponding compressed data length
-
compressed_data: compressed data
The decompression algorithm in snappy python library is as follows:
def decompress(self, data): """Decompress 'data', returning a string containing the uncompressed data corresponding to at least part of the data in string. This data should be concatenated to the output produced by any preceding calls to the decompress() method. Some of the input data may be preserved in internal buffers for later processing. """ int_size = _INT_SIZE self._buf += data uncompressed = [] while True: if len(self._buf) < int_size: return b"".join(uncompressed) next_start = 0 if not self._block_length: self._block_length = unpack_int(self._buf[:int_size]) self._buf = self._buf[int_size:] if len(self._buf) < int_size: return b"".join(uncompressed) compressed_length = unpack_int( self._buf[next_start:next_start + int_size] ) next_start += int_size if len(self._buf) < compressed_length + next_start: return b"".join(uncompressed) chunk = self._buf[ next_start:next_start + compressed_length ] self._buf = self._buf[next_start + compressed_length:] uncompressed_chunk = _uncompress(chunk) self._uncompressed_length += len(uncompressed_chunk) uncompressed.append(uncompressed_chunk) if self._uncompressed_length == self._block_length: # Here we have uncompressed all subblocks of the current block self._uncompressed_length = 0 self._block_length = 0 continue
The code logic is relatively simple, and the process is as follows:
-
Read block_length and initialize uncompressed_length, which indicates the length of the current decompression result
-
Read the compressed_length and compressed_data, decompress them with raw snappy codec, get uncompressed_data, append it to the output, and update the uncompressed_length
-
When uncompressed_length reaches block_length, it means that the current block has been processed, the internal state is reset and ready to process the next block, This cycle continues until the end of the file
reference resources
c++ snappy lib: https://github.com/google/snappy
python snappy lib: https://github.com/andrix/python-snappy