RUST learning diary Lesson 16 - common methods of string

Posted by opido on Thu, 06 Jan 2022 02:16:10 +0100

RUST learning diary Lesson 16 - common methods of string (2)

0x00 review and opening

The previous lesson introduced the common modification methods of Rust strings. This lesson begins with the access methods of Rust strings. This is the fourth article on Rust strings. If there is still time in the future, I will introduce some other knowledge of string in more detail.

0x01 Unicode and UTF-8

The most common coding in the computer should be ASCII coding, but the range of ASCII coding is only 0x000x7F, which can not store Chinese characters, minority characters and so on. Thus, various codes such as GB2312 and GB18030 appear. In order to unify character coding, the international organization for standardization has formulated a general multi byte coded character set, that is, Unicode character set. It contains all the languages and characters in the world. The most commonly used ranges are 0x00000xD7FF and 0xE000~0x10FFFF.

However, each character of Unicode character set occupies 4 bytes. In order to save space, UTF-8 is the simplest and most efficient encoding format. In Rust, both String and &str types represent text in UTF-8 encoding format. UTF-8 is a variable length encoding with 1 byte as the encoding unit. It encodes the code bit into 1 ~ 4 bytes according to certain rules. As shown in the following table:

UTF-8 encoding (1 ~ 4 bytes)	Code point representation	uNICODE range
0xxxxxxx	0bxxxxxxx	0x00~0x7f
110xxxxx 10aaaaaa	0bxxxxxaaaaaa	0x80~0x7ff
1110xxxx 10aaaaaa 10bbbbbb	0bxxxaaaaaabbbbbb	0x800~0xffff
11110xxx 10aaaaaa 10bbbbbb 10cccccc	0bxxxaaaaaabbbbbbcccccc	0x10000~0x10ffff

This is a good understanding of the reason why Chinese characters account for 3 bytes and English letters and Arabic numerals account for 1 byte.

Examples of UTF-8 codes are as follows:

UTF-8 encoding (1 ~ 4 bytes)	character	Code point
01100001	a	0b1100001 == 0x61
11000010_10101001	©	0b00010_101001 == 0xa9
11100110_10110001_10001001	Chinese	0b0110_110001_001001 == 0x6c49
11110000_10011111_10011000_10000011	😃	0b000_011111_011000_000011== 0x1f603

The example code is as follows:

println!("***************1,code*****************");
    let a = "a";
    let b = "©";
    let c = "Chinese";
    let d = "😃";

    println!("a occupy {} Bytes", std::mem::size_of_val(a));
    println!("b occupy {} Bytes", std::mem::size_of_val(b));
    println!("c occupy {} Bytes", std::mem::size_of_val(c));
    println!("d occupy {} Bytes", std::mem::size_of_val(d));

    println!("\n***************1,code(Print binary)*****************");
    for x in a.bytes() {
        print!("{:08b}_", x);
    }
    println!();
    for x in b.bytes() {
        print!("{:08b}_", x);
    }
    println!();
    for x in c.bytes() {
        print!("{:08b}_", x);
    }
    println!();
    for x in d.bytes() {
        print!("{:08b}_", x);
    }

    println!("\n***************1,code(Print Unicode)*****************");
    println!("{:X}", 'a' as i32);
    println!("{:X}", '©' as i32);
    println!("{:X}", 'Chinese' as i32);
    println!("{:X}", '😃' as i32);

Code run result:

***************1,code*****************
a Occupy 1 byte
b 2 bytes
c 3 bytes
d 4 bytes

***************1,code(Print binary)*****************
01100001_
11000010_10101001_
11100110_10110001_10001001_
11110000_10011111_10011000_10000011_
***************1,code(Print Unicode)*****************
61
A9
6C49
1F603

The encoding and decoding rules will not be repeated here. Those interested can search the data. If there are many messages, I'll go back to an additional chapter to explain coding and decoding.

0x02 access to string

In Rust, you should pay attention to the following two points when accessing strings:

1. Because the string is a UTF-8 encoded byte sequence and variable length encoding, the index cannot be directly used to access characters.

2. String operations are divided into two ways: byte processing and character processing. The bytes() method is used to process by byte and return the iterator of byte iteration. Using the chars() method is to process by character and return the iterator of character iteration.

Length of string

If the length of the string is obtained through the len() method, the length in bytes is returned, that is, the total number of bytes of all characters in the string. If through chars() The length obtained by the count () method represents the length of the character. The length obtained by this method is the length of the string we often talk about.

The example code is as follows:

let string_length = "I am learning Rust~";
println!("\"{}\"Byte length of : {}", string_length, string_length.len());
println!("\"{}\"Character length of : {}", string_length, string_length.chars().count());

Code run result:

"I am learning Rust~"Byte length of : 20
"I am learning Rust~"Character length of : 10

Access string elements

Since the string of Rust is UTF-8 encoded, it is not allowed to directly use the index to access a single character element. Then we can only access it with the help of iterators. (the knowledge about iterators will be introduced in the following chapters) the bytes() and char() methods return byte and character iterators respectively. Among them, nth method can access elements in the form of index. The method returns the Option type.

The example code is as follows:

let string_nth = "Rust Fundamentals of programming";

// Access the 5th character
dbg!(string_nth.chars().nth(5));
// Access the 5th byte
dbg!(string_nth.bytes().nth(5));

Code run result:

[src\main.rs:45] string_nth.chars().nth(5) = Some(
    'Course',
)
[src\main.rs:47] string_nth.bytes().nth(5) = Some(
    188,
)

0x03 summary

It took four articles to simply introduce Rust's string. If you want to really understand the method of string, you still need more practice. In fact, there are still a lot of knowledge and related methods about Rust string. Due to the limited space, the explanation of string will come to an end.

0x05 references

Unified code coding range of all Unicode sections | Unicode symbol library ✏️ (fuhaoku.net)

0x04 source code of this section

016. StudyRust - Code cloud - Open Source China (gitee.com)

The next section is a preview - process control.

Programmer Think

RUST learning diary Lesson 16 - common methods of string