Exploration of slice's In function in Go

Posted by ntohky14 on Sun, 15 Sep 2019 06:00:05 +0200

I was aware of the question before: Why doesn't Golang have the same functionality as in Python?So, after searching this question, we found that many people still have such questions.

Let's talk about this topic today.If you find this article helpful, thank you!

In is a very common function and may be called contains in some languages. Although different languages have different representations, they are almost all there.Unfortunately, Go does not provide a standard library function like the Python operator in, or in_array in PHP, as other languages do.

Gos philosophy is that the pursuit of less is more.I think maybe the Go team thinks this is a function that is not enough to achieve.

Why is that insignificant?What if you want to do it yourself?

All three implementations I think of are traversal, binary search of sort, and key index of map.

The source code for this article has been uploaded to my github. poloxue/gotin.

ergodic

Traverse should be the easiest way to do it that we can easily think of.

Examples are as follows:

func InIntSlice(haystack []int, needle int) bool {
    for _, e := range haystack {
        if e == needle {
            return true
        }
    }

    return false
}

The example above demonstrates how to find the existence of a specified int in a []int type variable is not very simple, so we can also feel why I said it was trivial to implement.

The limitation of this example is that it only supports a single type.Reflection is required to support generic in functionality like interpretive languages.

The code is as follows:

func In(haystack interface{}, needle interface{}) (bool, error) {
    sVal := reflect.ValueOf(haystack)
    kind := sVal.Kind()
    if kind == reflect.Slice || kind == reflect.Array {
        for i := 0; i < sVal.Len(); i++ {
            if sVal.Index(i).Interface() == needle {
                return true, nil
            }
        }

        return false, nil
    }

    return false, ErrUnSupportHaystack
}

For more general purposes, the input parameters to the In function, haystack and needle, are both interface {} types.

To put it simply, the input parameters are all the benefits of interface {}. There are two main points, as follows:

First, haystack is an interface {} type, so in supports more than slices but also array.We see that the haystack is type checked by reflection inside the function, supporting slices and arrays.If there are other types, you will be prompted for errors and adding support for new types, such as map, is actually easy.However, this is not recommended because in is achieved using the, ok: = m[k] syntax.

Second, haystack is interface {}, then []interface {} also meets the requirements, and needle is interface {}.In this way, we can achieve the same effect as interpretive language.

How do you understand?A direct example illustrates the following:

gotin.In([]interface{}{1, "two", 3}, "two")

The haystack is []interface{}{1,'two', 3}, and the needle is interface{}, where the value is'two'.It looks like elements can be of any type in an interpretive language, without having to have exactly the same effect.In this way, we can use it arbitrarily.

It is important to note, however, that there is a piece of code in the implementation of the In function:

if sVal.Index(i).Interface() == needle {
    ...
}

Not all types of Go s can be compared with ==and if an element contains slice or map, errors may occur.

Binary Search

The disadvantage of traversing to confirm whether an element exists is that if an array or slice contains a large amount of data, such as 10,000,000 pieces of data, or one million, at worst, we have to traverse 10,000,000 times to confirm that time complexity On.

What can I do to reduce the number of traversals?

The natural way to think about it is the binary search, which has a time complexity of log2(n).However, this algorithm has a premise that it depends on ordered sequences.

Thus, the first problem to solve is to make the sequence ordered, which is already available in the Go s standard library under the sort package.

The sample code is as follows:

fmt.Println(sort.SortInts([]int{4, 2, 5, 1, 6}))

For []int, the function we use is SortInts, and for other types of slices, sort also provides related functions, such as []string sorting through SortStrings.

After sorting, you can do a binary search, and fortunately, this feature is also available on Go, where the []int type function corresponds to SearchInts.

To briefly introduce this function, first look at the definition:

func SearchInts(a []int, x int) int

The input parameters are easy to understand, searching for x from slice a.The main point is return values, which are important for later confirmation of the existence of elements.Returns the meaning of the value, returns the position of the lookup element in the slice, and returns where the element should be inserted while keeping the slice order if it does not exist.

For example, the sequence is as follows:

1 2 6 8 9 11

Assuming that x is 6, it will be found at index 2 after lookup; if x is 7, it will be found that the element does not exist, if a sequence is inserted, it will be placed between 6 and 8, and the index position is 3, thus returning a value of 3.

Under Code Test:

fmt.Println(sort.SearchInts([]int{1, 2, 6, 8, 9, 11}, 6)) // 2
fmt.Println(sort.SearchInts([]int{1, 2, 6, 8, 9, 11}, 7)) // 3

If you are determining whether an element is in a sequence, you can simply determine whether the value at the return position is the same as the value you are looking for.

In another case, if the insertion element is at the end of the sequence, for example, if the element value is 12, the insertion position is 6.If you look directly for elements at six locations, you may be crossing the boundary.Then what shall I do?In fact, it is sufficient to determine if the return is greater than the slice length, which means the element is not in the slice sequence.

The complete implementation code is as follows:

func SortInIntSlice(haystack []int, needle int) bool {
    sort.Ints(haystack)

    index := sort.SearchInts(haystack, needle)
    return index < len(haystack) && haystack[index] == needle
}

However, there is a problem, and it is not cost-effective to sort each query in an unordered scenario.Finally, a sort can be achieved, slightly modifying the code.

func InIntSliceSortedFunc(haystack []int) func(int) bool {
    sort.Ints(haystack)

    return func(needle int) bool {
        index := sort.SearchInts(haystack, needle)
        return index < len(haystack) && haystack[index] == needle
    }
}

In the above implementation, we sort the haystack slices by calling IntSliceSortedFunc and return a function that can be used multiple times.

Use cases are as follows:

in := gotin.InIntSliceSortedFunc(haystack)

for i := 0; i<maxNeedle; i++ {
    if in(i) {
        fmt.Printf("%d is in %v", i, haystack)
    }
}

What are the shortcomings of the dichotomy search?

The important thing I think about is that elements must be sortable for binary lookups, such as int, string, float types.Structures, slices, arrays, maps, and so on, are not very convenient to use, and of course, they can be used, but they require some appropriate extensions, sorting by specified criteria, such as a member of the structure.

The in implementation of binary lookup is now described.

map key

This section describes the map key approach.Its algorithm complexity is O1, and query performance remains the same regardless of the amount of data.It mainly depends on the map data type in Go. We should all be familiar with the algorithm by checking the existence of key directly through hash map, which maps directly to the index location.

We often use this method.

_, ok := m[k]
if ok {
    fmt.Println("Found")
}

So how does it combine with in?A case illustrates the problem.

Suppose we have a variable of type []int, as follows:

s := []int{1, 2, 3}

To check the existence of an element using the map's capabilities, you can convert s to map[int]struct{}.

m := map[interface{}]struct{}{
    1: struct{}{},
    2: struct{}{},
    3: struct{}{},
    4: struct{}{},
}

If you check for the existence of an element, you can simply write as follows:

k := 4
if _, ok := m[k]; ok {
    fmt.Printf("%d is found\n", k)
}

Is it very simple?

To add, read a previous article I wrote about why struct {} is used here How to use set in Go Article.

Following this line of thought, the following functions are implemented:

func MapKeyInIntSlice(haystack []int, needle int) bool {
    set := make(map[int]struct{})

    for _ , e := range haystack {
        set[e] = struct{}{}
    }

    _, ok := set[needle]
    return ok
}

It's not difficult to implement, but it has the same problem as binary lookup, starting with data processing and converting slice to map.If the data is the same each time, modify its implementation slightly.

func InIntSliceMapKeyFunc(haystack []int) func(int) bool {
    set := make(map[int]struct{})

    for _ , e := range haystack {
        set[e] = struct{}{}
    }

    return func(needle int) bool {
        _, ok := set[needle]
        return ok
    }
}

For the same data, it returns an in function that can be used multiple times, with one use case as follows:

in := gotin.InIntSliceMapKeyFunc(haystack)

for i := 0; i<maxNeedle; i++ {
    if in(i) {
        fmt.Printf("%d is in %v", i, haystack)
    }
}

Compared with the first two algorithms, this method has the highest processing efficiency and is very suitable for processing large data.Next performance tests, we'll see the effect.

performance

After describing all the ways, let's actually compare the performance of each algorithm.The test source is located at gotin_test.go File.

Benchmarking mainly examines the performance of different algorithms from the size of data. In this paper, three levels of test sample data are selected, which are 10, 1000, 1000000.

For testing purposes, a function for generating haystack and needle sample data is first defined.

The code is as follows:

func randomHaystackAndNeedle(size int) ([]int, int){
    haystack := make([]int, size)

    for i := 0; i<size ; i++{
        haystack[i] = rand.Int()
    }

    return haystack, rand.Int()
}

The input parameter is size, with rand.Int() randomly generating a haystack and a needle with a slice size of size.In a benchmark test case, this random function is introduced to generate data.

An example is as follows:

func BenchmarkIn_10(b *testing.B) {
    haystack, needle := randomHaystackAndNeedle(10)

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, _ = gotin.In(haystack, needle)
    }
}

First, a random HaystackAndNeedle slice containing 10 elements was randomly generated.Because the time to generate the sample data should not be included in the benchmark, we reset the time using b.ResetTimer().

Secondly, the manometric function is written according to the rule of Test+function name+sample data size, such as BenchmarkIn_10 in the case, which means the test In function has 10 sample data.If we are testing In InIntSlice with 1000 data measures, the pressure function is named BenchmarkIntSlice_1000.

Start the test!Simply put my notebook configuration, Mac Pro 15, 16G memory, 512 SSD, 4 cores, 8 threaded CPU.

Tests the performance of all functions with 10 data volumes.

$ go test -run=none -bench=10$ -benchmem

Matches all pressure functions ending in 10.

Test results:

goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_10-8                         3000000               501 ns/op             112 B/op         11 allocs/op
BenchmarkInIntSlice_10-8                200000000                7.47 ns/op            0 B/op          0 allocs/op
BenchmarkInIntSliceSortedFunc_10-8      100000000               22.3 ns/op             0 B/op          0 allocs/op
BenchmarkSortInIntSlice_10-8            10000000               162 ns/op              32 B/op          1 allocs/op
BenchmarkInIntSliceMapKeyFunc_10-8      100000000               17.7 ns/op             0 B/op          0 allocs/op
BenchmarkMapKeyInIntSlice_10-8           3000000               513 ns/op             163 B/op          1 allocs/op
PASS
ok      github.com/poloxue/gotin        13.162s

The best performers are not SortedFunc and MapKeyFunc, but the simplest single-type traversal queries, which take an average of 7.47ns/op, and of course, the other two are doing well, 22.3ns/op and 17.7ns/op, respectively.

The worst performers were In, SortIn, and MapKeyIn, which took an average of 501 ns/op and 513 ns/op, respectively.

Tests the performance of all functions with a data volume of 1000.

$ go test -run=none -bench=1000$ -benchmem

Test results:

goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_1000-8                         30000             45074 ns/op            8032 B/op       1001 allocs/op
BenchmarkInIntSlice_1000-8               5000000               313 ns/op               0 B/op          0 allocs/op
BenchmarkInIntSliceSortedFunc_1000-8    30000000                44.0 ns/op             0 B/op          0 allocs/op
BenchmarkSortInIntSlice_1000-8             20000             65401 ns/op              32 B/op          1 allocs/op
BenchmarkInIntSliceMapKeyFunc_1000-8    100000000               17.6 ns/op             0 B/op          0 allocs/op
BenchmarkMapKeyInIntSlice_1000-8           20000             82761 ns/op           47798 B/op         65 allocs/op
PASS
ok      github.com/poloxue/gotin        11.312s

InIntSlice, IntSliceSortedFunc, and IntSliceMapKeyFunc remained the top three performers, but this time the order changed. MapKeyFunc performed best, with 17.6 ns/op showing little change compared to 10.Verify the previous statement again.

Similarly, when the data volume is 1000000.

$ go test -run=none -bench=1000000$ -benchmem

The test results are as follows:

goos: darwin
goarch: amd64
pkg: github.com/poloxue/gotin
BenchmarkIn_1000000-8                                 30          46099678 ns/op         8000098 B/op    1000001 allocs/op
BenchmarkInIntSlice_1000000-8                       3000            424623 ns/op               0 B/op          0 allocs/op
BenchmarkInIntSliceSortedFunc_1000000-8         20000000                72.8 ns/op             0 B/op          0 allocs/op
BenchmarkSortInIntSlice_1000000-8                     10         138873420 ns/op              32 B/op          1 allocs/op
BenchmarkInIntSliceMapKeyFunc_1000000-8         100000000               16.5 ns/op             0 B/op          0 allocs/op
BenchmarkMapKeyInIntSlice_1000000-8                   10         156215889 ns/op        49824225 B/op      38313 allocs/op
PASS
ok      github.com/poloxue/gotin        15.178s

MapKeyFunc still performs best, taking 17.2 ns per operation, followed by Sort, while InIntSlice shows a linear increase.Normally, for scenarios where performance is not a special requirement and data volume is very high, traversal for a single type already performs very well.

From the test results, it can be seen that the generic In function implemented by reflection requires a large amount of memory allocation each time it is executed, which is convenient and at the expense of performance.

summary

This article raises the topic by asking why there is no Python-like In method in Go.I think on the one hand, it is very simple and unnecessary to achieve.On the other hand, in different scenarios, we need to analyze which way to implement it, rather than a fixed one, according to the actual situation.

Next, we describe the three ways in which In is implemented, and analyze their advantages and disadvantages.Through the performance analysis test, we can draw a general conclusion, which way is suitable for what scene, but the overall analysis is not detailed enough, interested friends can continue to research.