「R」 dplyr formulation calculation

Posted by Cascade on Sun, 23 Jan 2022 07:05:46 +0100

❝ recently, use "dplyr" to perform multi column selective operations, such as change_ At (), I found that the document indicates that a series of "dplyr" function variants have expired. It seems that they will retire in the future. Using across() is their unified substitute. Therefore, I recently took time to study and translate specifically, hoping to bring you some help. This article is the first one, which introduces "column calculation", and a subsequent one will introduce data processing by row. The original text comes from [dplyr document] (column wise operations • dplyr (tidyverse.org "dplyr document") - 2021-01 ❞

It is often useful to perform the same function operation on multiple columns of the data frame at the same time, but it is boring and prone to errors by copying and pasting.

For example:

df %>% 
  group_by(g1, g2) %>% 
  summarise(a = mean(a), b = mean(b), c = mean(c), d = mean(d))

(if you want to calculate the average of each line a, B, C and D, please see the article on line formula calculation)

This article will introduce you to the cross() function, which can help you rewrite the above code in a more concise way:

df %>% 
  group_by(g1, g2) %>% 
  summarise(across(a:d, mean))

We'll start by discussing the basic usage of across(), especially applying it to summarize () and showing how to use it in conjunction with multiple functions. Then we will show the use of some other verbs. Finally, we will briefly introduce the history and explain why we prefer across() to the latter method (i.e. _if ()_ at(), _ All() variant function) and how to convert your old code into a new syntax implementation.

Load package:

library(dplyr, warn.conflicts = FALSE)

Basic Usage

across() has two main parameters:

  • The first parameter is cols, which is used to select the columns you want to operate on. It uses tidy selection syntax (like select()), so you can select variables by location, name and type.
  • The second parameter is fns, which is a function or a list of functions applied to the data column. It can also be like ~ The "purrr" style formula syntax of x/2.

Here are some examples of combining across() with its favorite verb function summarize (). But you can also combine across() with any other "dplyr" verb function, which we will mention later.

starwars %>% 
  summarise(across(where(is.character), ~ length(unique(.x))))
#> # A tibble: 1 x 8
#>    name hair_color skin_color eye_color   sex gender homeworld species
#>   <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
#> 1    87         13         31        15     5      3        49      38

starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>% 
  summarise(across(c(sex, gender, homeworld), ~ length(unique(.x))))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 9 x 4
#>   species    sex gender homeworld
#>   <chr>    <int>  <int>     <int>
#> 1 Droid        1      2         3
#> 2 Gungan       1      1         1
#> 3 Human        2      2        16
#> 4 Kaminoan     2      2         1
#> # ... with 5 more rows

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 10 x 4
#>   homeworld height  mass birth_year
#>   <chr>      <dbl> <dbl>      <dbl>
#> 1 Alderaan    176.  64         43  
#> 2 Corellia    175   78.5       25  
#> 3 Coruscant   174.  50         91  
#> 4 Kamino      208.  83.1       31.5
#> # ... with 6 more rows

Because across() is used in conjunction with summarize () and mutate(), it does not select grouping variables to avoid accidentally modifying them.

df <- data.frame(g = c(1, 1, 2), x = c(-1, 1, 3), y = c(-1, -4, -9))
df %>% 
  group_by(g) %>% 
  summarise(across(where(is.numeric), sum))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     1     0    -5
#> 2     2     3    -9

Multiple functions

You can perform multiple function operations on each variable at the same time by passing in a named list of functions (including lambda functions) for the second parameter.

min_max <- list(
  min = ~min(.x, na.rm = TRUE), 
  max = ~max(.x, na.rm = TRUE)
)
starwars %>% summarise(across(where(is.numeric), min_max))
#> # A tibble: 1 x 6
#>   height_min height_max mass_min mass_max birth_year_min birth_year_max
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66       264       15     1358              8            896

You can pass The names parameter uses the glue[1] specification to control the name of the new column name:

starwars %>% summarise(across(where(is.numeric), min_max, .names = "{.fn}.{.col}"))
#> # A tibble: 1 x 6
#>   min.height max.height min.mass max.mass min.birth_year max.birth_year
#>        <int>      <int>    <dbl>    <dbl>          <dbl>          <dbl>
#> 1         66        264       15     1358              8            896

If you prefer to put all summaries with the same function together (that is, the following min results are on the left and max is on the right), you must make your own extension call:

starwars %>% summarise(
  across(where(is.numeric), ~min(.x, na.rm = TRUE), .names = "min_{.col}"),
  across(where(is.numeric), ~max(.x, na.rm = TRUE), .names = "max_{.col}")
)
#> # A tibble: 1 x 9
#>   min_height min_mass min_birth_year max_height max_mass max_birth_year
#>        <int>    <dbl>          <dbl>      <int>    <dbl>          <dbl>
#> 1         66       15              8        264     1358            896
#> # ... with 3 more variables: max_min_height <int>, max_min_mass <dbl>,
#> #   max_min_birth_year <dbl>

(this operation may one day be supported by a parameter of across(), but we haven't found a solution yet)

Current column

If necessary, you can call cur_column() to get the name of the current column. This may be useful if you want to perform some context dependent related transformations:

df <- tibble(x = 1:3, y = 3:5, z = 5:7)
mult <- list(x = 1, y = 10, z = 100)

# df multiplies each column by the value of the corresponding column of mult
df %>% mutate(across(all_of(names(mult)), ~ .x * mult[[cur_column()]]))
#> # A tibble: 3 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1     1    30   500
#> 2     2    40   600
#> 3     3    50   700

trap

Note the combination is Use of numeric() and numeric summary:

df <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))

df %>% 
  summarise(n = n(), across(where(is.numeric), sd))
#>    n x        y
#> 1 NA 1 4.041452

Here, n becomes NA because n is numerical, so cross() calculates its standard deviation, and the standard deviation of 3 (constant) is NA. You can finally calculate n() to solve this problem:

df %>% 
  summarise(across(where(is.numeric), sd), n = n())
#>   x        y n
#> 1 1 4.041452 3

In addition, you can explicitly exclude n to solve this problem:

df %>% 
  summarise(n = n(), across(where(is.numeric) & !n, sd))
#>   n x        y
#> 1 3 1 4.041452

Other verbs

So far, we have focused on the combination of cross() and summarize(), but it can also work with other "dplyr" verb functions:

• rescale all numeric variables to the range 0-1:

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
df <- tibble(x = 1:4, y = rnorm(4))
df %>% mutate(across(where(is.numeric), rescale01))
#> # A tibble: 4 x 2
#>       x     y
#>   <dbl> <dbl>
#> 1 0     0.385
#> 2 0.333 1    
#> 3 0.667 0    
#> 4 1     0.903

Find all rows without missing values for variables:

starwars %>% filter(across(everything(), ~ !is.na(.x)))
#> # A tibble: 29 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke...    172    77 blond      fair       blue            19   male  mascu...
#> 2 Dart...    202   136 none       white      yellow          41.9 male  mascu...
#> 3 Leia...    150    49 brown      light      brown           19   fema... femin...
#> 4 Owen...    178   120 brown, gr... light      blue            52   male  mascu...
#> # ... with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

Yes, some like group_ For verbs such as by (), count() and distinct(), you can omit the summary function:

Find all unique values:

starwars %>% distinct(across(contains("color")))
#> # A tibble: 67 x 3
#>   hair_color skin_color  eye_color
#>   <chr>      <chr>       <chr>    
#> 1 blond      fair        blue     
#> 2 <NA>       gold        yellow   
#> 3 <NA>       white, blue red      
#> 4 none       white       yellow   
#> # ... with 63 more rows

Calculate the number of combinations of all variables in a given mode:

starwars %>% count(across(contains("color")), sort = TRUE)
#> # A tibble: 67 x 4
#>   hair_color skin_color eye_color     n
#>   <chr>      <chr>      <chr>     <int>
#> 1 brown      light      brown         6
#> 2 brown      fair       blue          4
#> 3 none       grey       black         4
#> 4 black      dark       brown         3
#> # ... with 63 more rows

Cross () cannot work with select() or rename(), because the latter two functions already support tidy selection syntax. If you want to convert column names through functions, you can use rename_with().

_if, _at, _all

Previous versions of "dplyr" allowed functions to be applied to multiple columns in different ways: Using_ if,_ at and_ all suffix function. These functions have been used by many people to solve urgent needs, but now they have been replaced. This means that they will always exist, but will not get any new features, only fix key bug s.

Why do we like cross ()?

Why did we decide to migrate from the above function to across()? The reasons are as follows.

across() enables it to express useful summaries that were previously impossible to express:

df %>%
  group_by(g1, g2) %>% 
  summarise(
    across(where(is.numeric), mean), 
    across(where(is.factor), nlevels),
    n = n(), 
  )

Cross() reduces the number of functions that "dplyr" needs to provide. This makes "dplyr" easier to use (because there are fewer functions to remember) and makes it easier for us to implement new verbs (because we only need to implement one function instead of four).

across() unifies_ if and_ The semantics of at allows us to choose variables according to location, name and type, or even combine them at will, which was impossible before. For example, you can now convert a numeric column that starts with an X: across (where (is. Numeric) & starts_ with("x")).

Cross () does not require vars()_ The at() function is the only place in "dplyr" where you need to manually reference variable names, which makes them strange and difficult to remember.

Why did it take so long to find across()?

Disappointingly, we didn't find Cross () earlier, but experienced several wrong attempts (first, we didn't realize that this is a common problem, then we used the _each() function, and finally we used_ if()/_at()/_all() function). However, the development of across() is inseparable from the following three latest findings:

  • You can have a column of a data frame, which is itself a data frame. This is provided by base R, but it is not well documented. It took us a while to find it useful, not just theoretical curiosity.
  • We can use the data frame to make the summary function return multiple columns.
  • We can use no external name as a convention for unpacking data frame columns into separate columns.

How do you transfer existing code?

Fortunately, converting existing code to an across() implementation is usually very intuitive:

  • Remove function_ if(), _at() and _all() suffix
  • Call across(), and the first parameter is as follows: If there are other parameters, just keep them as they are.
    1. For_ if(), the original second parameter is wrapped in where()
    2. For_ at(), the original parameter. If there is a vars() package, it will be removed
    3. For_ all(), using everything()

For example:

df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(where(is.numeric), mean, na.rm = TRUE))

df %>% mutate_at(vars(c(x, starts_with("y"))), mean)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))

df %>% mutate_all(mean)
# ->
df %>% mutate(across(everything(), mean))

There are some exceptions to this rule:

rename_* () and select_* () follow different patterns. They already have selective semantics, so they are usually used in a different way from across(). We need to use the new rename_ Replace with ().

Previous filter() and all_vars() and any_vars() helps you pair functions. Now, across() is equivalent to all_vars(), but no any_ A direct alternative to vars (), but you can create one yourself:

df <- tibble(x = c("a", "b"), y = c(1, 1), z = c(-1, 1))

# Find all rows satisfying that each numeric column is greater than 0
df %>% filter(across(where(is.numeric), ~ .x > 0))
#> # A tibble: 1 x 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 b         1     1

# Find all rows that satisfy that any numeric column is greater than 0
rowAny <- function(x) rowSums(x) > 0
df %>% filter(rowAny(across(where(is.numeric), ~ .x > 0)))
#> # A tibble: 2 x 3
#>   x         y     z
#>   <chr> <dbl> <dbl>
#> 1 a         1    -1
#> 2 b         1     1

When used in mutate(), all transformations performed by across() are completed at once. This is different from mutate_if(),mutate_at() and mutate_ Unlike all (), which does only one conversion at a time. We hope you will not be surprised by this new behavior:

df <- tibble(x = 2, y = 4, z = 8)
df %>% mutate_all(~ .x / y)
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     8

df %>% mutate(across(everything(), ~ .x / y))
#> # A tibble: 1 x 3
#>       x     y     z
#>   <dbl> <dbl> <dbl>
#> 1   0.5     1     2

Summary

The developers of "dplyr" have simplified the processing logic of "dplyr" for some complex data operations through cross (), improved the overall learning and use efficiency, and made our users pay more attention to logic rather than implementation. Like it!

Reference

[1]

glue: https://glue.tidyverse.org/