PHP string Unicode encoding and correct truncation method

Posted by kaitan on Sat, 18 Dec 2021 16:44:41 +0100

When PHP and other languages call each other to transfer data, we often encounter the problem of string coding. For example, I recently developed an RPC service using go language. When PHP is used as the client call, JSON is used for the transmitted object data_ Encode for serialization,

Source string:

> Test 41
> {"json":['test json test json String format'incorrect'}

The serialized content should be similar to the following:

> \u6d4b\u8bd541\n> {\"json\":['test json \u6d4b\u8bd5json\u4e32\u683c\u5f0f'\u4e0d\u5bf9'}

This serialization seems to have no problem. We can easily use json_decode to get the original string.

However, my service needs to limit the string length, that is, if the string exceeds the limit, it will be truncated, similar to the following:

> Test 41
> {"json":['test json test...

The first problem here is that the truncated string length is not the expected length. Why? Because if the PHP client truncates the string according to the limited length before serialization, the length of the truncated string after serialization must be inconsistent with the result of truncation after serialization of the source string.

Assuming that we limit the length to 50 (single byte characters), we write the following script to show the results before and after truncation and serialization:

<?php

$str1 = "> Test 41\n> {\"json\":['test json test json String format'incorrect'}";

$str2 = json_encode($str1);

echo "str1: ".$str1."\n";
echo "Length of str1: ".strlen($str1)."\n\n";

echo "str2: ".$str2."\n";
echo "Length of str2: ".strlen($str2)."\n\n";

echo "MultiByte Length of str1: ".mb_strlen($str1)."\n\n";

$limit = 50;

$cut1 = substr($str1, 0, $limit);
$cut2 = substr($str2, 0, $limit);
$cut3 = mb_substr($str1, 0, $limit);

echo "cut1: ".$cut1."\n";
echo "Encode cut1: ".json_encode($cut1)."\n";
echo "Length of encoded cut1: ".strlen(json_encode($cut1))."\n\n";

echo "cut2: ".$cut2."\n";
echo "Length of cut2: ".strlen($cut2)."\n\n";

echo "cut3: ".$cut3."\n";
echo "Encode cut3: ".json_encode($cut3)."\n";
echo "Length of encoded cut3: ".strlen(json_encode($cut3))."\n\n";

The running results of the script are as follows:

str1: > Test 41
> {"json":['test json test json String format'incorrect'}
Length of str1: 61

str2: "> \u6d4b\u8bd541\n> {\"json\":['test json \u6d4b\u8bd5json\u4e32\u683c\u5f0f'\u4e0d\u5bf9'}"
Length of str2: 93

MultiByte Length of str1: 43

cut1: > Test 41
> {"json":['test json test json String lattice�
Encode cut1: 
Length of encoded cut1: 0

cut2: "> \u6d4b\u8bd541\n> {\"json\":['test json \u6d4b\
Length of cut2: 50

cut3: > Test 41
> {"json":['test json test json String format'incorrect'}
Encode cut3: "> \u6d4b\u8bd541\n> {\"json\":['test json \u6d4b\u8bd5json\u4e32\u683c\u5f0f'\u4e0d\u5bf9'}"
Length of encoded cut3: 93

You can see that the source string is truncated and then JSON_ There is a problem with encode. The serialization result is an empty string

However, if we use the mbstring method to measure and truncate the string, the length of the serialized string exceeds our limit

Therefore, the correct approach is to serialize the string first, and then truncate the string according to the restrictions. This ensures that strings are processed in different languages with consistent results.

There are many ways to serialize Unicode strings on the network, but after testing, I found that if it is truncated, it may cause problems in deserialization because it contains special symbols of JSON, braces, or other symbols. Finally, after trying, I realized the following coding method by using the JSON module of PHP:

<?php
//Encode content in UNICODE
function unicode_encode($str){
    $pattern = '/([\[\]\{\}])/i';
    $replacement = '\\\\${1}';
    $str = preg_replace($pattern, $replacement, $str);

    $str = '{"str":"'. $str.'"}';
    $encode = json_encode($str);
    return substr($encode, 12, -4);
}

//Decode the UNICODE encoded content
function unicode_decode($unicode_str){
    //Avoid half unicode truncation
    $cut_pos_check = strrpos($unicode_str, '\u', 0);
    if(strlen($unicode_str) - $cut_pos_check < 6) { // unicode encoding is 6 bytes: \ uxxxx
        $unicode_str = substr($unicode_str, 0, $cut_pos_check);
    }

    if(substr($unicode_str, -1) == "\\" && substr($unicode_str, -2, 1) != "\\") {  //Remove the case where the last truncation is \, and not \
        $unicode_str = substr($unicode_str, 0, -1);
    }

    $json = '{"str":"'.$unicode_str.'"}';
    $obj = json_decode($json);
    if(empty($obj)) return '';
    return $obj->str;
}

//Truncate string
function cutData($data, $occupied, $limit) {
    if($limit < 0) $limit = intval($this->config->ratchet->msglimit);

    $msglimit = $limit - $occupied;   //Some lengths that must be occupied are removed
    echo(date("Y-m-d h:i:s")." Msg length: ".strlen($data)."\n");
    echo(date("Y-m-d h:i:s")." Msg limit: ".$msglimit."\n");

    $tmp = preg_replace("/\&nbsp\;/", " ", unicode_encode(trim(htmlspecialchars_decode($data))));    //Encoding as Unicode
    $encoded_len = strlen($tmp);
    echo(date("Y-m-d h:i:s")." Unicode Encoded: ".$tmp."\n\n");
    echo(date("Y-m-d h:i:s")." Unicode Encoded Msg length: ".$encoded_len."\n");

    if($encoded_len > $msglimit) {  //Limit length exceeded
        $sub = substr($tmp, 0, $msglimit);

        echo(date("Y-m-d h:i:s")." Substring: ".$sub."\n\n");
        echo(date("Y-m-d h:i:s")." Unicode Decoded: ".unicode_decode($sub)."\n\n");
        return unicode_decode($sub) . "...";   //Returns the truncated decoded string
    }

    return preg_replace("/\&nbsp\;/", " ", $data);
}

Test with another test script:

<?php
$str1 = "> Test 41\n> {\"json\":['test json test json String format'incorrect'}";

$str2 = unicode_encode($str1);

echo "str1: ".$str1."\n";
echo "Length of str1: ".strlen($str1)."\n\n";

$limit = 50;

$cut = cutData($str1, 0, $limit);

echo "cut: ".$cut."\n";

The results are similar to the following:

str1: > Test 41
> {"json":['test json test json String format'incorrect'}
Length of str1: 61

2021-08-28 11:18:26 Msg length: 61
2021-08-28 11:18:26 Unicode Encoded: > \u6d4b\u8bd541\n> \\{\"json\":\\['test json \u6d4b\u8bd5json\u4e32\u683c\u5f0f'\u4e0d\u5bf9'\\}

2021-08-28 11:18:26 Unicode Encoded Msg length: 97
2021-08-28 11:18:26 Msg limit: 50
2021-08-28 11:18:26 Substring: > \u6d4b\u8bd541\n> \\{\"json\":\\['test json \u6d

2021-08-28 11:18:26 Unicode Decoded: > Test 41
> \{"json":\['test json 

cut: > Test 41
> \{"json":\['test json ...

Topics: PHP Back-end unicode