R data sorting (XI: using purrr package to realize more fancy anonymous functions)

Posted by paruby on Sun, 19 Dec 2021 17:17:30 +0100

I feel that the functions in purrr package are very similar to those related to anonymous functions in py.

In terms of function, it is more like simplifying and enriching the call of apply family functions.

1. map family

In fact, in addition to being useful for vectors, map can also be used for data frames or matrix types. It is equivalent to viewing each column as a separate element, which is a bit like apply ing by column:

> map(infos, typeof)
$family
[1] "character"

$name
[1] "character"

$born
[1] "double"

> apply(infos, 2, typeof)
     family        name        born 
"character" "character" "character"

2. Unknown function in purrr

Data:

s <- c('10, 8, 7', 
      '5, 2, 2', 
      '3, 7, 8', 
      '8, 8, 9')

For example, if the map function needs to use a custom nameless function, it can be used similar to that of apply and so on:

map_dbl(strsplit(s, split=",", fixed=TRUE),
  function(x) sum(as.numeric(x)))
## [1] 25  9 18 25

map provides short usage:

map_dbl(strsplit(s, split=",", fixed=TRUE),
  ~ sum(as.numeric(.)))

The nameless function is written in the format of "~ expression". The expression is the definition of the nameless function Represents the name of the argument when there is only one argument x and y represents the argument name when there are only two arguments, with 1,.. 2,.. 3 such a name indicates the name of the argument when there are multiple arguments.

It should be noted that if unknown functions in functions such as map() need to access other variables, they need to understand their variable scope or access environment. In addition, other variables in the nameless function are recalculated and evaluated every time map() is applied to the elements of the input list. It is recommended to use a named function in this case, so that the scope rules are easier to control when accessing other variables and will not be evaluated repeatedly. (if you want to use other variables, don't abbreviate them.)

ps: in fact, it can also be implemented through the apply family, but the code is a bit messy:

> lapply(s, function(x) sum(as.numeric(unlist(strsplit(x, ",")))))
[[1]]
[1] 25

[[2]]
[1] 9

[[3]]
[1] 18

[[4]]
[1] 25

And apple doesn't know the shorthand method:

> lapply(s, ~ sum(as.numeric(unlist(strsplit(., ",")))))
Error in match.fun(FUN) : 
  '~sum(as.numeric(unlist(strsplit(., ","))))'Not a function, not a character, not a symbol

3. Extract abbreviations of list elements

map can be abbreviated not only when calling nameless functions, but also when extracting list elements.

Complex data is sometimes represented as a list of lists, and each list element is a list or vector. This nested structure is often used when converting JSON, YAML and other formats to R objects. Generally, this type of data is expressed in the format of nested list after importing R, that is, each element in the list is also a list.

od <- list(
  list(
    101, name="Li Ming", age=15, 
    hobbies=c("painting", "music")),
  list(
    102, name="Zhang Cong", age=17,
    hobbies=c("Football"),
    birth="2002-10-01")
)

In order to get the first item of each list element, it should have been written as:

map_dbl(od, function(x) x[[1]])
## [1] 101 102

map_dbl(od, ~ .[[1]])
## [1] 101 102

purrr package provides a further simplified writing method. Where a function or an "~ expression" is required, the integer subscript value can be used to extract the specified components from each list element, such as:

map_dbl(od, 1)
## [1] 101 102

> map_chr(od, "name")
[1] "Li Ming" "Zhang Cong"

We can also specify a list, which is the member serial number or member name to realize layer by layer mining:

map_chr(od, list("hobbies", 1))
## [1] "Painting" "football"

Represents the first element of the hobbies member of each list element (everyone's first hobby).

There is an error in fetching non - existent members, but you can use one The default option specifies options when members cannot be found, such as:

map_chr(od, "birth", .default=NA)
## [1] NA           "2002-10-01"

4. Variant of map function

Variants of map:

map_lgl(): Return logical vector;
map_int(): Returns an integer vector;
map_dbl(): Returns a double precision floating point vector(double type)；
map_chr(): Returns a character vector.

In addition, there are other variants of map:

modify()，Input a data argument and a function, and output results of the same type as the input data;
map2()You can input two data independent variables and a function, transform the elements with the same subscript of the two independent variables with the function, and output the list;
imap()Traversal according to a subscript;
walk()Input a data argument and a function, do not return any results, and only use the side effects of the input function;
Input several data independent variables and a function, and transform the elements with the same subscript of the data independent variable with the function;

According to the input type of map, it can be divided into:

A data argument, represented by map()；
Two independent variables, represented by map2()；
An independent variable and a subscript variable, represented by imap()；
Multiple independent variables, represented by pmap().

Input type and output type are matched. purrr package provides 27 map class functions.

modify

The usage of modify is similar to that of map. The advantage is that it can return the same type of data. If it is a data frame input, the output is also a data frame:

> d1 <- modify(d2, ~ if(is.numeric(.)) . - median(.) else .)
> d1
  x1 x2 sex
1 -6 -4   M
2  1  8   F
3  2 -1   M
4 -1  1   F

The purrr package also provides a modify_if() function, you can modify the columns that meet the conditions, such as:

> d2 <- modify_if(d2, is.numeric, ~ .x - median(.x))
> 
> d2
  x1 x2 sex
1 -6 -4   M
2  1  8   F
3  2 -1   M
4 -1  1   F

That is, a function that returns the result of a logical value is added as a parameter.

walk

The walk function does not return any results. Sometimes it only needs to traverse a data structure and call the function for some display and drawing, which is called the side effect of the function and does not need to return results. purrr's walk function addresses this situation. For example, use cat to view and output the variable categories in the data frame:

walk(d.class, ~ cat(typeof(.x), "\n"))
## character 
## character 
## double 
## double 
## double

The walk2() function can accept two data arguments, similar to map2(). For example, if you need to save a group of data to a file separately, you can use the data list and the character vector of the saved file name as the two data arguments of walk2().

dl <- split(d.class, d.class[["sex"]])
walk2(dl, paste0("class-", names(dl), ".csv"), 
      ~ write.csv(.x, file=.y))

You can also use pipe symbols more intuitively:

d.class %>%
  split(d.class[["sex"]]) %>%
  walk2(paste0("class-", names(.), ".csv"), ~ write.csv(.x, file=.y))

PS: the function of walk is very useful in operation and saving, which can save the trouble of circulation, and the basic R does not provide a function similar to walk.

iwalk/imap

This family of functions can access subscripts or element names and element values at the same time. It is equivalent to getting two variables each time you traverse the data, one is the element value and the other is the element subscript (if there is an element name, it is the element name). If x has an element name, imap(x, f) is equivalent to imap2(x, names(x), f); If x has no element name, imap(x, f) is equivalent to imap2(x, seq_along(x), f).:

For example, display the variable name of each column of the data frame:

iwalk(d.class, ~ cat(.y, ": ", typeof(.x), "\n"))
## name :  character 
## sex :  character 
## age :  double 
## height :  double 
## weight :  double

pmap

The vectorization of R can well deal with the situation that each independent variable is a vector, but it can not automatically vectorize multiple independent variables such as list and data frame. pmap class functions of purrr package support vectorization of multiple lists, data frames, vectors, etc. Instead of taking multiple lists as multiple arguments, pmap packages them into one list. Therefore, map2(x, y, f) is represented by pmap() as pmap(list(x, y), f).

Equivalent to multi-dimensional traversal map:

x <- list(101, name="Li Ming")
y <- list(102, name="Zhang Cong")
z <- list(103, name="kingdom")
pmap(list(x, y, z), c)
## [[1]]
## [1] 101 102 103
## 
## $name
## [1] "Li Ming", "Zhang Cong", "kingdom"

For the data frame, execute the function for each row of the data frame (similar to the map for columns, which is similar to apply for selecting rows or columns). For example:

d <- tibble::tibble(
  x = 101:103, 
  y=c("Li Ming", "Zhang Cong", "kingdom"))
pmap_chr(d, function(...) paste(..., sep=":"))
         
## [1] "101: Li Ming" "102: Zhang Cong" "103: Kingdom"

5. reduce class function

reduce(1:4, `+`)
## [1] 10

The execution is actually 1 + 2 + 3 + 4.

Although the result is consistent with sum, reduce can perform item by item consolidation calculation on the list with complex elements.

For example, if you want to take the intersection of the following data:

set.seed(5)
x <- replicate(4, sample(
  1:5, size=5, replace=TRUE), simplify=FALSE); x
## [[1]]
## [1] 2 3 1 3 1
## 
## [[2]]
## [1] 1 5 3 3 2
## 
## [[3]]
## [1] 5 4 2 5 3
## 
## [[4]]
## [1] 1 4 3 2 5

We can use pipe symbols to carry out very cumbersome:

x[[1]] %>% intersect(x[[2]]) %>% intersect(x[[3]]) %>% intersect(x[[4]])

And reduce just one thing:

reduce(x, intersect)
## [1] 2 3

ps: reduce() support Parameters, so you can give the function you want to call additional arguments or options. So for ifelse, can you add parameters? For complex content, you don't need layers of dolls.

reduce2

X in reduce2(x, y, f) is the data list or vector for continuous operations, and Y provides different parameters for these operations. without. init initial value, f only needs to call length(x)-1 times, so y only needs length(x)-1 elements; If so init initial value, f needs to call length(x) times, and y also needs to be as long as X.

accumulate

Cumulative is to reduce, similar to cumsum to sum. It will return the result of each function operation:

accumulate(x, union)
## [[1]]
## [1] 2 3 1 3 1
## 
## [[2]]
## [1] 2 3 1 5
## 
## [[3]]
## [1] 2 3 1 5 4
## 
## [[4]]
## [1] 2 3 1 5 4

Map reduce algorithm

Map reduce is an important algorithm in big data technology. It is mainly used in Hadoop distributed database. Store the data in different computing nodes, map the required operations to each computing node, extract and compress the information, and finally integrate the information of different nodes with the idea of reduce.

6. Functional using explicit functions

some

Some (. x,. p) for data lists or vectors Each element of x is represented by p judgment, as long as at least one is true, the result is true; every (. x,. p) is similar to some, but the result is true only if the result of all elements is true. These functions are similar to any (map_lgl (. x,. p)) and all (map_lgl (. x,. p)). (more flexible any or all)

> d1
# A tibble: 4 x 2
     x1    x2
  <dbl> <dbl>
1   106   101
2   108   112
3   103   107
4   110   105
> some(d1, is.numeric)
[1] TRUE

detect

Detect (. x,. p) returns data The first element of x is used p is determined as the true element value, while detect_ Index (. x,. p) returns the first true subscript value.

Returns the value of the first element in the vector that exceeds 100:

detect(c(1, 5, 77, 105, 99, 123), ~ . >= 100)
## [1] 105
 Returns the subscript of the first element in the vector that exceeds 100:

detect_index(c(1, 5, 77, 105, 99, 123),
  ~ . >= 100)
## [1] 4

keep/discard

Keep (. x,. p) select data Used in the element of x p is a subset of elements judged to be true; Discard (. x,. p) returns a subset of elements that do not meet the criteria.

x. Other useful functions

For example, keep can be specifically used to select subsets of columns or list elements in the data frame that meet certain conditions, which are given by a function that returns a logical value. For example:

> keep(infos, is.character)
# A tibble: 4 x 2
  family name 
  <chr>  <chr>
1 Zhang     three   
2 Lee     four   
3 king     five   
4 Zhao     six

In fact, other vectorization methods are also good.

> tmp2 = unlist(map(infos, typeof)) %in% "character"
> infos[,tmp2]
# A tibble: 4 x 2
  family name 
  <chr>  <chr>
1 Zhang     three   
2 Lee     four   
3 king     five   
4 Zhao     six

Programmer Think