Rust language Bible 12 - strings and slices

Posted by spxmgb on Tue, 07 Dec 2021 19:31:57 +0100

Original link: https://course.rs/basic/string-slice.html
 
Welcome to Rust programming college, the best Rust learning community in China

  1. Official website: https://college.rs
  2. QQ group: 1009730433

character string

In other languages, string is often given sub questions, because it is too simple. For example, "hello, world" is almost the whole content of the string chapter, right? If you come to Rust with this idea,
I promise, it will definitely fall. Therefore, we must pay attention to this chapter and read it carefully. There are many contents that are not found in other Rust books.

Let's start with a very simple code:

fn main() {
  let my_name = "Pascal";
  greet(my_name);
}

fn greet(name: String) {
  println!("Hello, {}!", name);
}

The greet function accepts a name parameter of string type and prints it to the terminal console. It's very easy to understand. Can you guess whether this code can be compiled?

error[E0308]: mismatched types
 --> src/main.rs:3:11
  |
3 |     greet(my_name);
  |           ^^^^^^^
  |           |
  |           expected struct `std::string::String`, found `&str`
  |           help: try using a conversion method: `my_name.to_string()`

error: aborting due to previous error

Bingo, sure enough, reported an error. The compiler prompted that the Green function needs a String of type String, but it passed in a String of type str. I believe there must be several grass and mud horses roaring in the reader's mind. How can the String make so much work?

Before we talk about strings, let's take a look at what is slicing?

Slice

Slicing is not a unique concept of Rust. It is very popular in Go language. It allows you to refer to a continuous sequence of elements in a collection instead of the whole collection.

For strings, slicing is a reference to a part of a String type. It looks like this:

let s = String::from("hello world");

    let hello = &s[0..5];
    let world = &s[6..11];

hello does not refer to the whole String s, but refers to a part of S, which is specified by [0.. 5].

This is the syntax for creating slices. A sequence included in square brackets: [start index... End index], where the start index is the index position of the first element in the slice, and the end index is the index position after the last element, that is, this is a right half open interval. Internally, the slice data structure will save the start position and the length of the slice, where the length is calculated by terminating the index and starting the index.

For let world = & s [6.. 11]; For example, world is a slice whose pointer points to the 7th byte of S (the index starts from 0, and 6 is the 7th byte), and the length of the slice is 5 bytes.

Figure: a String slice references a part of another String

When using Rust's.. range syntax, if you want to start with index 0, you can use the following methods, which are equivalent:

let s = String::from("hello");

let slice = &s[0..2];
let slice = &s[..2];

Similarly, if your slice wants to contain the last byte of String, you can use this:

let s = String::from("hello");

let len = s.len();

let slice = &s[3..len];
let slice = &s[3..];

You can also intercept the complete String slice:

let s = String::from("hello");

let len = s.len();

let slice = &s[0..len];
let slice = &s[..];

Be careful when using slicing syntax for strings. The index of slicing must fall at the boundary between characters, that is, the boundary of UTF8 characters. For example, Chinese occupies three bytes in UT8, and the following code will crash:

 let s = "Chinese";
 let a = &s[0..2];
 println!("{}",a);

Because we only take the first two bytes of s string, but a Chinese occupies three bytes, it does not fall at the boundary, that is, even the Chinese word is incomplete. At this time, the program will crash and exit directly. If it is changed to & A [0.. 3], it can be compiled normally
Therefore, when you need to slice and index strings, you should be very careful. For how to operate utf8 strings, see here

The type mark of the string slice is & STR, so we can declare a function by entering the string type and returning its slice: FN first_ word(s: &String) -> &str.

With slicing, you can write such security code:

fn main() {
    let mut s = String::from("hello world");

    let word = first_word(&s);

    s.clear(); // error!

    println!("the first word is: {}", word);
}

The compiler reported the following error:

error[E0502]: cannot borrow `s` as mutable because it is also borrowed as immutable
  --> src/main.rs:18:5
   |
16 |     let word = first_word(&s);
   |                           -- immutable borrow occurs here
17 | 
18 |     s.clear(); // error!
   |     ^^^^^^^^^ mutable borrow occurs here
19 | 
20 |     println!("the first word is: {}", word);
   |                                       ---- immutable borrow later used here

Recall the rule of borrowing: when we already have variable borrowing, we can no longer have immutable borrowing. Because clear needs to clear and change the String, it needs a variable borrowing, and then the println! Immutable borrowing is used again, so the compilation cannot pass.

As can be seen from the above code, Rust not only makes our api easier to use, but also eliminates a lot of errors when the compiler is in place!

Other slices

Because slicing is a partial reference to a collection, not only strings but also other collection types, such as arrays:

let a = [1, 2, 3, 4, 5];

let slice = &a[1..3];

assert_eq!(slice, &[2, 3]);

The type of the array slice is & [I32], and the array slice works in the same way as the string slice. For example, it holds a reference to an element and length of the original array. For collection types, we This chapter It is described in detail in.

String literals are slices

The literal value of string was mentioned before, but its type was not mentioned:

let s = "Hello, world!";

In fact, the type of s is & STR, so you can also declare it this way:

let s: &str = "Hello, world!";

The slice points to a point in the program executable, which is why the string literal is immutable because & STR is an immutable reference.

After understanding the slice, you can enter the topic of this section.

What is a string?

As the name suggests, a string is a continuous set of characters, but as we mentioned in the previous section, the characters in Rust are Unicode, so each character occupies 4 bytes of memory space, but it is different in the string. The string is UTF8 encoded, that is, the number of bytes occupied by characters becomes longer (1-4), This helps greatly reduce the memory space occupied by strings

At the language level, Rust has only one String type: str, which usually appears as a reference type & str, that is, the String slice mentioned above. Although there are only the above str types at the language level, there are many String types for different purposes in the standard library, among which the most widely used is the String type.

STR type is hard coded into the executable file and cannot be modified, but String is a UTF8 encoded String that can be increased, changed and has ownership. When Rust users mention strings, they often refer to String type and &str String slice type, both of which are UTF8 encoded

In addition to String type strings, Rust's standard library also provides other types of strings, such as OsString,OsStr,CsString and CsStr. Notice that these names end with String or Str? They correspond to variables with ownership and borrowed variables respectively.

Operation string

Since String is a variable String, we can create, add or delete it. The following code summarizes the relevant operation methods:

fn main() {
    // Create an empty String
    let mut s = String::new();
    // Add "hello,world" of type & STR to the
    s.push_str("hello,world");
    // The character '!' Push into s
    s.push('!');
    // The last s is "hello,world!"
    assert_eq!(s,"hello,world!");

    // Create a String type from an existing & STR slice
    let mut s = "hello,world".to_string();
    // The character '!' Push into s
    s.push('!');
    // The last s is "hello,world!"
    assert_eq!(s,"hello,world!");

    // Create a String type from an existing & STR slice
    // Both String and & STR are UTF8 encoded, so Chinese is supported
    let mut s = String::from("Hello,world");
    // The character '!' Push into s
    s.push('!');
    // The last s is "hello,world!"
    assert_eq!(s,"Hello,world!");

    let s1 = String::from("Hello,");
    let s2 = String::from("world!");
    // In the following sentence, the ownership of s1 has been transferred, so s1 can no longer be used later
    let s3 = s1 + &s2; // note s1 has been moved here and can no longer be used
    assert_eq!(s3,"hello,world!");
    // The following statement will report an error if the comment is removed
    // println!("{}",s1);
}

In the above code, we need to explain that + is used to add strings. The reason why S1 + & S2 is used here is because + uses the add method. The definition of this method is similar to:

fn add(self, s: &str) -> String {

Because this method involves more complex feature functions, we will briefly explain here that self is a String s1 of String type. This function explains that you can only add a String slice of & STR type to s1 of String type, and then return a new String type, so let s3 = s1 + & s2; It's easy to explain. Add s1 of String type and s2 of & STR type to get s3 of String type

It can be inferred that the following codes are also legal:

  let s1 = String::from("tic");
  let s2 = String::from("tac");
  let s3 = String::from("toe");

  // String = String + &str + &str + &str + &str
  let s = s1 + "-" + &s2 + "-" + &s3;

String + & STR returns a string, and then continues the + operation with a & STR to return a string type. It keeps looping, and finally generates an s, which is also a string type.

In the above code, we have done a somewhat difficult string operation. Let's talk about it.

Conversion between String and & str

In the previous code, we have seen several operations to generate String type from & STR type:

  • String::from("hello,world")
  • "hello,world".to_string()

So how do you convert a String type to a & STR type? The answer is very simple, just take the reference:

fn main() {
    let s = String::from("hello,world!");
    say_hello(&s);
    say_hello(&s[..]);
    say_hello(s.as_str());
}

fn say_hello(s: &str) {
    println!("{}",s);
}

In fact, this flexible usage is due to deref coercion, which we will see in Deref feature Explain in detail.

String index

In other languages, it is normal to access a character or substring of a string by index, but an error will be reported in Rust:

   let s1 = String::from("hello");
   let h = s1[0];

This code produces the following error:

3 |     let h = s1[0];
  |             ^^^^^ `String` cannot be indexed by `{integer}`
  |
  = help: the trait `Index<{integer}>` is not implemented for `String`

https://rustwiki.org/en/book/ch08-02-strings.html#storing-utf-8-encoded-text-with-strings

Deep inside string

The underlying data storage format of a string is actually [u8], a byte array. For let hello = String::from("Hola"); For this line of code, the length of hello is 4 bytes, because each letter in "hola" occupies only 1 byte in UTF8 coding, but what about the following code?

let hello = String::from("Chinese");

If you ask how long the string is, you may say 3, but it is actually 9 bytes long, because the length of each Chinese character in UTF8 is 3 bytes, so hello is indexed in this case
Accessing & Hello [0] doesn't make any sense because you can't get the character in, but get the first byte of the three bytes of the character, which is a very strange and incomprehensible return value.

Different representations of strings

Now look at the string "न म स्त" written in Sanskrit. Its underlying byte array is as follows:

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164,
224, 165, 135]

The length is 18 bytes, which is also the form in which the computer finally stores the string. If you look at the form of characters, it is:

['न', 'म', 'स', '्', 'त', 'े']

However, in this situation, the fourth and sixth letters do not exist at all and have no meaning. Then, from the form of letter string:

["न", "म", "स्", "ते"]

Therefore, it can be seen that Rust provides different ways to display strings, so that programs can choose the way they want to use, regardless of what strings look like from the perspective of human language.

There is another reason that Rust is not allowed to index characters: because of the index operation, we always expect its performance to be O(1), and for String types, this cannot be guaranteed, because Rust may need to traverse the String from 0 to locate legal characters.

String slicing

As mentioned earlier, string slicing is a very dangerous operation, because the slicing index is carried out by bytes, but the string is UTF8 encoded, so you can't guarantee that the indexed bytes just fall on the character boundary, for example:

let hello = "Chinese";

let s = &hello[0..2];

Running the above program will directly cause a crash:

thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside 'in' (bytes 0..3) of `Chinese`', src/main.rs:4:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The prompt here is very clear. The byte of our index falls inside the Chinese character. This return has no meaning.

Therefore, when accessing the string through the index interval, you need to be extra careful. If you don't pay attention, your program will crash!

Operation UTF8 string

The method of using UTF8 string in several has been mentioned above, which will be explained one by one below.

character

If you want to traverse a string in Unicode characters, the best way is to use the chars method, for example:

for c in "Chinese".chars() {
    println!("{}", c);
}

The output is as follows

in
 country
 people

byte

This method returns the underlying byte array representation of the string:

for b in "Chinese".bytes() {
    println!("{}", b);
}

The output is as follows:

228
184
173
229
155
189
228
186
186

Get substring

It is more complicated to accurately obtain substrings from UTF8 strings. For example, you can't use the standard library to extract a substring from the variable length string of holla Chinese,
You need to search utf8 on crites.io to find the function you want.

Consider trying this library: utf8 slice.

String underlying parsing

@todo

Topics: Back-end Rust