RUST learning diary Lesson 16 - common methods of string (2)
0x00 review and opening
The previous lesson introduced the common modification methods of Rust strings. This lesson begins with the access methods of Rust strings. This is the fourth article on Rust strings. If there is still time in the future, I will introduce some other knowledge of string in more detail.
0x01 Unicode and UTF-8
The most common coding in the computer should be ASCII coding, but the range of ASCII coding is only 0x000x7F, which can not store Chinese characters, minority characters and so on. Thus, various codes such as GB2312 and GB18030 appear. In order to unify character coding, the international organization for standardization has formulated a general multi byte coded character set, that is, Unicode character set. It contains all the languages and characters in the world. The most commonly used ranges are 0x00000xD7FF and 0xE000~0x10FFFF.
However, each character of Unicode character set occupies 4 bytes. In order to save space, UTF-8 is the simplest and most efficient encoding format. In Rust, both String and &str types represent text in UTF-8 encoding format. UTF-8 is a variable length encoding with 1 byte as the encoding unit. It encodes the code bit into 1 ~ 4 bytes according to certain rules. As shown in the following table:
UTF-8 encoding (1 ~ 4 bytes) | Code point representation | uNICODE range |
---|---|---|
0xxxxxxx | 0bxxxxxxx | 0x00~0x7f |
110xxxxx 10aaaaaa | 0bxxxxxaaaaaa | 0x80~0x7ff |
1110xxxx 10aaaaaa 10bbbbbb | 0bxxxaaaaaabbbbbb | 0x800~0xffff |
11110xxx 10aaaaaa 10bbbbbb 10cccccc | 0bxxxaaaaaabbbbbbcccccc | 0x10000~0x10ffff |
This is a good understanding of the reason why Chinese characters account for 3 bytes and English letters and Arabic numerals account for 1 byte.
Examples of UTF-8 codes are as follows:
UTF-8 encoding (1 ~ 4 bytes) | character | Code point |
---|---|---|
01100001 | a | 0b1100001 == 0x61 |
11000010_10101001 | © | 0b00010_101001 == 0xa9 |
11100110_10110001_10001001 | Chinese | 0b0110_110001_001001 == 0x6c49 |
11110000_10011111_10011000_10000011 | 😃 | 0b000_011111_011000_000011== 0x1f603 |
The example code is as follows:
println!("***************1,code*****************"); let a = "a"; let b = "©"; let c = "Chinese"; let d = "😃"; println!("a occupy {} Bytes", std::mem::size_of_val(a)); println!("b occupy {} Bytes", std::mem::size_of_val(b)); println!("c occupy {} Bytes", std::mem::size_of_val(c)); println!("d occupy {} Bytes", std::mem::size_of_val(d)); println!("\n***************1,code(Print binary)*****************"); for x in a.bytes() { print!("{:08b}_", x); } println!(); for x in b.bytes() { print!("{:08b}_", x); } println!(); for x in c.bytes() { print!("{:08b}_", x); } println!(); for x in d.bytes() { print!("{:08b}_", x); } println!("\n***************1,code(Print Unicode)*****************"); println!("{:X}", 'a' as i32); println!("{:X}", '©' as i32); println!("{:X}", 'Chinese' as i32); println!("{:X}", '😃' as i32);
Code run result:
***************1,code***************** a Occupy 1 byte b 2 bytes c 3 bytes d 4 bytes ***************1,code(Print binary)***************** 01100001_ 11000010_10101001_ 11100110_10110001_10001001_ 11110000_10011111_10011000_10000011_ ***************1,code(Print Unicode)***************** 61 A9 6C49 1F603
The encoding and decoding rules will not be repeated here. Those interested can search the data. If there are many messages, I'll go back to an additional chapter to explain coding and decoding.
0x02 access to string
In Rust, you should pay attention to the following two points when accessing strings:
1. Because the string is a UTF-8 encoded byte sequence and variable length encoding, the index cannot be directly used to access characters.
2. String operations are divided into two ways: byte processing and character processing. The bytes() method is used to process by byte and return the iterator of byte iteration. Using the chars() method is to process by character and return the iterator of character iteration.
Length of string
If the length of the string is obtained through the len() method, the length in bytes is returned, that is, the total number of bytes of all characters in the string. If through chars() The length obtained by the count () method represents the length of the character. The length obtained by this method is the length of the string we often talk about.
The example code is as follows:
let string_length = "I am learning Rust~"; println!("\"{}\"Byte length of : {}", string_length, string_length.len()); println!("\"{}\"Character length of : {}", string_length, string_length.chars().count());
Code run result:
"I am learning Rust~"Byte length of : 20 "I am learning Rust~"Character length of : 10
Access string elements
Since the string of Rust is UTF-8 encoded, it is not allowed to directly use the index to access a single character element. Then we can only access it with the help of iterators. (the knowledge about iterators will be introduced in the following chapters) the bytes() and char() methods return byte and character iterators respectively. Among them, nth method can access elements in the form of index. The method returns the Option type.
The example code is as follows:
let string_nth = "Rust Fundamentals of programming"; // Access the 5th character dbg!(string_nth.chars().nth(5)); // Access the 5th byte dbg!(string_nth.bytes().nth(5));
Code run result:
[src\main.rs:45] string_nth.chars().nth(5) = Some( 'Course', ) [src\main.rs:47] string_nth.bytes().nth(5) = Some( 188, )
0x03 summary
It took four articles to simply introduce Rust's string. If you want to really understand the method of string, you still need more practice. In fact, there are still a lot of knowledge and related methods about Rust string. Due to the limited space, the explanation of string will come to an end.
0x05 references
Unified code coding range of all Unicode sections | Unicode symbol library ✏️ (fuhaoku.net)
0x04 source code of this section
016. StudyRust - Code cloud - Open Source China (gitee.com)
The next section is a preview - process control.