Algorithm - discretization

Posted by kpowning on Sat, 25 Dec 2021 22:08:30 +0100

The discretization in this section refers in particular to the order preserving discretization of integers.

What is 1

There is such a problem in processing: processing the number (eg.105) with a small number in a large data range (eg.0 ~ 109), that is, "large value range, small number". The large storage space and high time complexity brought by large value range will cause space-time waste, and the discretization process is to solve this problem.
For example, for the processing of four numbers with subscripts: 1, 50, 10000 and 109, we can map them to continuous natural numbers starting from 0: 0, 1, 2 and 3. Taking these numbers as new array subscripts to store data can greatly reduce the space-time complexity.
The following is Baidu's definition of "discretization":

2 treatment method

2.1 supplement two functions:

2.1.1 unique()

① Form: iterator unique(iterator it_1,iterator it_2);
② Result: the elements in the [it_1, it_2) range in the container are de duplicated (Note: the interval is closed before open, that is, it does not contain the elements referred to in it_2). The return value is an iterator, which points to the next element of the last element in the non repeating sequence in the container after de duplication
③ Essence: the de duplication process of the unique function is actually constantly moving the non repeating elements in the back to the front. It can also be said that the non repeating elements occupy the position of the repeating elements
④ Examples:

vector<int> a = {1,3,3,4,5,6,6,7};
unique(a.begin(), a.end());
for(int i = 0; i < a.size(); i ++){
	cout << a[i];
}
//Result: 13456767

⑤ Note: you usually sort before using unique()

2.1.2 erase()

It's not so troublesome to delete the value of the [it_1, it_2) range between two iterators.

2.2 two operations

2.2. 1. Weight removal

When processing data, the same subscript may be operated successively. At this time, the subscript to be discretized will be repeated, so it should be de duplicated first.

vector<int> alls; // Store all values to be discretized
sort(alls.begin(), alls.end()); // Sort all values
alls.erase(unique(alls.begin(), alls.end()), alls.end());   // Remove duplicate elements

2.2. 2 calculate the value after discretization

We discretize the subscript to be solved into a new subscript, and the value corresponding to the original subscript should also correspond to the corresponding subscript.

// Find the discretized value corresponding to x
int find(int x) // Find the first position greater than or equal to x
{
    int l = 0, r = alls.size() - 1;
    while (l < r)
    {
        int mid = l + r >> 1;
        if (alls[mid] >= x) r = mid;
        else l = mid + 1;
    }
    return r + 1; // Map to 1, 2 n
}

Mapping to 1, 2,... n is convenient for later processing. Such as prefix and.

3 examples

Interval sum

Link: 802. Interval and
analysis:

  • This problem first operates on a single number many times, indicating that it does not need to use difference
  • Then, query the interval sum of multiple [l, r], indicating that the prefix and are required
  • Value range: 10-9 < = x < = 109
    Number: n operations, m queries, 1 < = n, m < = 105, and M queries are two numbers each time. Therefore, the number of processes only needs to be < = 3 × one hundred and five
    Therefore, it is necessary to discretize the data first.

Overall idea:
When receiving data, use the form of pair < int, int > and let a[find(first)] = second to discretize the data.
When asking for data, turn a [n] - > s [n], and then S[r] - S[l-1].
So the key to the problem is how to make a[find(first)] = second. This is the discretization operation.
code:

#include<iostream>
#include<vector>
#include<algorithm>

using namespace std;

typedef pair<int, int> PII;

const int N = 300010;

int a[N], S[N];
vector<int> alls;  //Store all values to be discretized
vector<PII> add, query;

int find(int x){
    // Binary search for the position of an element in all (at this time, all has been sorted + de duplicated)
    int l = 0;
    int r = alls.size() - 1;
    while(l < r){
        int mid = l + r >> 1;
        if(alls[mid] >= x) r = mid;  // When r = mid, l + r does not need to add 1
        else l = mid + 1;
    }
    return r + 1; // return r + 1 is to change the subscript from 0 - > 1
} 

int main(){
    int n, m;
    cin >> n >> m;
    while(n --){
        int x, c;
        scanf("%d%d", &x, &c);
        add.push_back({x, c});
        
        alls.push_back(x);
    }
    
    while(m --){
        int l, r;
        scanf("%d%d", &l, &r);
        query.push_back({l, r});
        
        alls.push_back(l);
        alls.push_back(r);
    }
    
    //duplicate removal
    sort(alls.begin(), alls.end());
    alls.erase(unique(alls.begin(), alls.end()), alls.end());
    
    //Process insertion: obtain a new subscript and save the corresponding value in a[N]
    for(auto item : add){
        int x = find(item.first);
        a[x] += item.second; 
    }
    
    //Preprocessing prefix and
    for(int i = 1; i <= alls.size(); i ++) S[i] = S[i - 1] + a[i];
    
    //Processing queries
    for(auto item : query){
        int l = find(item.first);
        int r = find(item.second);
        cout << S[r] - S[l - 1] << endl;
    }
    return 0;
}

After a short time, a data error occurred. The reason is that in my find(), I wrote all [mid] > = x as all [mid] > X. This binary search board can't be wrong at all. It's really speechless.

Topics: C++ Algorithm data structure