protobuf parsing error: string garbled code of communication between Windows and Linux

Posted by Gruzin on Tue, 01 Feb 2022 01:16:34 +0100

phenomenon

Programs on Windows communicate with programs on Linux. Specifically, windows programs read a file on the machine and send data information to Linxu process.

The Linxu process is responsible for encoding the structure in protobuf, writing it to redis, and then reading it out for display when someone requests it.

The problem is that the encoding and writing are successful, but the pb parsing fails when querying. I can't figure it out for a moment.

Positioning and solution

After tracking and analysis, there is a string in the structure. When the field read by the windows process is in English, it is normal to be displayed in the Linux process.

When the field is Chinese, the Linux process log shows that the field is garbled:

The positioning is relatively simple. According to the above problem description, we can basically conclude that it is caused by the different coding methods of Windows and Linux.

Generally speaking, GBK coding is adopted for Windows Chinese, while utf-8 coding is adopted for Linux.

Therefore, the solution is to transcode GBK to utf-8 through character conversion after the Windows process reads the field. The example code is as follows:

#include <locale>
#include <codecvt>

#ifdef _MSC_VER
	const string kGbkLocaleName = ".936";
#else
	const string kGbkLocaleName = "zh_CN.GBK";
#endif

std::string CatsMgr::Gbk2Utf8(const std::string& str)
{
    std::wstring_convert<std::codecvt_byname<wchar_t, char, mbstate_t>> convert(new std::codecvt_byname<wchar_t, char, mbstate_t>(kGbkLocaleName));
    std::wstring tmp_wstr = convert.from_bytes(str);

    std::wstring_convert<std::codecvt_utf8<wchar_t>> cv2;
    return cv2.to_bytes(tmp_wstr);
}

After this transcoding, the Linux process displays Chinese normally.

extend

When cross platform inter process communication is involved, if the coding formats used by both sides are inconsistent, it will often cause some problems that the garbled code cannot be displayed.

c++11 provides the function of character coding conversion, which can meet the needs of character coding conversion in our usual development. Wstring is mainly used_ Convert and codecvt are combined for conversion.

The included header files are:

#include <locale>
#include <codecvt>

be careful:

std::wstring of windows platform is std::u16string, wchar_t is char16_t (utf-16 coding). The terminal code of window platform is generally gbk.
std::wstring of linux platform is std::u32string, wchar_t is char32_t (utf-32 encoding)

Several common transformations are provided below.

gbk to utf-8:

// First convert to std::wstring,
std::string gbk_to_utf8(const std::string& str)
{
	//GBK locale name in windows
	const char* GBK_LOCALE_NAME = ".936";
	std::wstring_convert<std::codecvt_byname<wchar_t, char, mbstate_t>> convert(new std::codecvt_byname<wchar_t, char, mbstate_t>(GBK_LOCALE_NAME));
	std::wstring tmp_wstr = convert.from_bytes(str);
 
	std::wstring_convert<std::codecvt_utf8<wchar_t>> cv2;
	return cv2.to_bytes(tmp_wstr);
}

utf-8 to gbk:

std::string utf8_to_gbk(const std::string& str)
{
	std::wstring_convert<std::codecvt_utf8<wchar_t> > conv;
	std::wstring tmp_wstr = conv.from_bytes(str);
 
	//GBK locale name in windows
	const char* GBK_LOCALE_NAME = ".936";
	std::wstring_convert<std::codecvt_byname<wchar_t, char, mbstate_t>> convert(new std::codecvt_byname<wchar_t, char, mbstate_t>(GBK_LOCALE_NAME));
	return convert.to_bytes(tmp_wstr);
}

gbk to std::wstring:

std::wstring gbk_to_wstr(const std::string& str)
{
	//GBK locale name in windows
	const char* GBK_LOCALE_NAME = ".936";
	std::wstring_convert<std::codecvt_byname<wchar_t, char, mbstate_t>> convert(new std::codecvt_byname<wchar_t, char, mbstate_t>(GBK_LOCALE_NAME));
	return convert.from_bytes(str);
}

std::string to STD:: wstring (UTF-8 -- > wchar):

// The encoding of std::string is required to be utf-8, otherwise an exception will be thrown
std::wstring utf8_to_wstr(const std::string& src)
{
	std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
	return converter.from_bytes(src);
}

std::wstring to STD:: string (wchar -- > UTF-8):

// The code of the obtained std::string is utf-8, and the output under windows is garbled (the Chinese output from Windows terminal should be displayed normally and converted to GBK code)
std::string wstr_to_utf8(const std::wstring& src)
{
	std::wstring_convert<std::codecvt_utf8<wchar_t>> convert;
	return convert.to_bytes(src);
}

Note that the locale name of GBK under linux may be "zh_CN.GBK", while it is ". 936" under windows. Therefore, it is still necessary to adapt different systems for cross platform.

Summary

In case of garbled code display, don't panic and don't sit idly by, because you have to take care of it sooner or later.

Therefore, when encountering the problem of garbled code, analyze and solve it at the first time. It only needs a few lines of code to solve all the problems.

reference resources

https://blog.csdn.net/qq_31175231/article/details/83865059

Programmer Think