Python 3 standard library: base64 encoding binary data in ASCII

Posted by Ne.OnZ on Sun, 12 Apr 2020 05:46:12 +0200

1. base64 encodes binary data in ASCII

The base64 module contains functions that transform binary data into a subset of ASCII suitable for transmission using the plain text protocol. Base64, Base32, Base16, and Base85 encode to convert 8-bit bytes into characters within the range of ASCII printable characters, leaving more bits to represent data, ensuring compatibility with systems that only support ASCII data, such as SMTP. The base value corresponds to the alphabet length used in each encoding. These original encodings also have some "URL safe" variants, which use slightly different alphabets.

1.1 base64 encoding

Encode some text.

import base64

# Load this source file and strip the header.
with open(__file__, 'r', encoding='utf-8') as input:
    raw = input.read()
    initial_data = raw.split('#end_pymotw_header')[1]

byte_string = initial_data.encode('utf-8')
encoded_data = base64.b64encode(byte_string)

num_initial = len(byte_string)

# There will never be more than 2 padding bytes.
padding = 3 - (num_initial % 3)

print('{} bytes before encoding'.format(num_initial))
print('Expect {} padding bytes'.format(padding))
print('{} bytes after encoding\n'.format(len(encoded_data)))
print(encoded_data)

The input must be a byte string, so first encode the Unicode string as UTF-8. The output shows that 185 bytes of UTF-8 source file are expanded to 248 bytes after encoding.

393 bytes before encoding
Expect 3 padding bytes
524 bytes after encoding

b'JylbMV0KCmJ5dGVfc3RyaW5nID0gaW5pdGlhbF9kYXRhLmVuY29kZSgndXRmLTgnKQplbmNvZGVkX2RhdGEgPSBiYXNlNjQuYjY0ZW5jb2RlKGJ5dGVfc3RyaW5nKQoKbnVtX2luaXRpYWwgPSBsZW4oYnl0ZV9zdHJpbmcpCgojIFRoZXJlIHdpbGwgbmV2ZXIgYmUgbW9yZSB0aGFuIDIgcGFkZGluZyBieXRlcy4KcGFkZGluZyA9IDMgLSAobnVtX2luaXRpYWwgJSAzKQoKcHJpbnQoJ3t9IGJ5dGVzIGJlZm9yZSBlbmNvZGluZycuZm9ybWF0KG51bV9pbml0aWFsKSkKcHJpbnQoJ0V4cGVjdCB7fSBwYWRkaW5nIGJ5dGVzJy5mb3JtYXQocGFkZGluZykpCnByaW50KCd7fSBieXRlcyBhZnRlciBlbmNvZGluZ1xuJy5mb3JtYXQobGVuKGVuY29kZWRfZGF0YSkpKQpwcmludChlbmNvZGVkX2RhdGEp'

1.2 base64 decoding

b64decode() converts the encoded string back to its original form. It takes four bytes and uses a lookup table to convert the four bytes back to the original three bytes.

import base64

encoded_data = b'VGhpcyBpcyB0aGUgZGF0YSwgaW4gdGhlIGNsZWFyLg=='
decoded_data = base64.b64decode(encoded_data)
print('Encoded :', encoded_data)
print('Decoded :', decoded_data)

During encoding, each 24 bit sequence (3 bytes) in the input is viewed, and then the 24 bits are encoded into 4 bytes in the output. An equal sign is inserted at the end of the output as a fill, because in this case, the number of digits in the original string cannot be divisible by 24.

The return value of b64decode() is a string of bytes. If the known content is text, the byte string can be converted to a Unicode object. However, because the meaning of using base64 encoding is to be able to transmit binary data, it is not necessarily safe to assume that the decoded value is text.

1.3 variants of URL security

Because the default base64 alphabet may use + and /, which are used in URL s, it is often necessary to replace these characters with a candidate encoding.

import base64

encodes_with_pluses = b'\xfb\xef'
encodes_with_slashes = b'\xff\xff'

for original in [encodes_with_pluses, encodes_with_slashes]:
    print('Original         :', repr(original))
    print('Standard encoding:',
          base64.standard_b64encode(original))
    print('URL-safe encoding:',
          base64.urlsafe_b64encode(original))
    print()

+Replace with -, / with underscore (). In addition, the alphabet is the same.

1.4 other codes

In addition to base64, this module provides functions to process base85, base32, and base16 (HEX) encoded data.

import base64

original_data = b'This is the data, in the clear.'
print('Original:', original_data)

encoded_data = base64.b32encode(original_data)
print('Encoded :', encoded_data)

decoded_data = base64.b32decode(encoded_data)
print('Decoded :', decoded_data)

The base32 alphabet consists of 26 uppercase letters and numbers 2 to 7 in the ASCII set.

The base16 function processes the hexadecimal alphabet.

import base64

original_data = b'This is the data, in the clear.'
print('Original:', original_data)

encoded_data = base64.b16encode(original_data)
print('Encoded :', encoded_data)

decoded_data = base64.b16decode(encoded_data)
print('Decoded :', decoded_data)

Each time the number of encoded bits drops, the output using the encoding format takes up more space.

The base85 function uses an extended alphabet, which is more space efficient than the alphabet used by base64 encoding.

import base64

original_data = b'This is the data, in the clear.'
print('Original    : {} bytes {!r}'.format(
    len(original_data), original_data))

b64_data = base64.b64encode(original_data)
print('b64 Encoded : {} bytes {!r}'.format(
    len(b64_data), b64_data))

b85_data = base64.b85encode(original_data)
print('b85 Encoded : {} bytes {!r}'.format(
    len(b85_data), b85_data))

a85_data = base64.a85encode(original_data)
print('a85 Encoded : {} bytes {!r}'.format(
    len(a85_data), a85_data))

Many Base85 encodings and variations are used in Mercurial, git, and PDF file formats. Python includes two implementations, b85encode() implements the version used in Git Mercurial, and a85encode() implements the ASCII 85 variant used in PDF files.

Topics: Python encoding ascii git

Programmer Think