将列展平为参数

时间:2017-09-16 06:52:39

标签: r dplyr

使用dplyr创建数据摘要时,我经常会发现自己正在计算CI(使用CI中的Rmisc):

summary <- data %>%
  group_by(group1, group2) %>%
  summarize(
    var1.mean = CI(var1, ci=0.95)['mean'],
    var1.lower = CI(var1, ci=0.95)['lower'],
    var1.upper = CI(var1, ci=0.95)['upper'],

    var2.mean = CI(var2, ci=0.95)['mean'],
    var2.lower = CI(var2, ci=0.95)['lower'],
    var3.upper = CI(var2, ci=0.95)['upper'],

    var3.mean = CI(var3, ci=0.95)['mean'],
    var3.lower = CI(var3, ci=0.95)['lower'],
    var3.upper = CI(var3, ci=0.95)['upper'],

    var4 = sum(var4)
  )

这既痛苦又冗长,效率低下。最后,我希望我能写一些类似的东西:

summary <- data %>%
  group_by(group1, group2) %>%
  summarize(
    var1 = CI(var1, ci=0.95),
    var2 = CI(var2, ci=0.95),
    var3 = CI(var3, ci=0.95),
    var4 = sum(var4)
  )

对于上面的代码,由于CI返回带有行

的命名列
  • "lower"
  • "upper"
  • "mean"

我希望我能得到一个包含以下列的数据框:

  • "group1"
  • "group2”,
  • "var1.lower"
  • "var1.mean"
  • "var1.upper"
  • "var2.lower"
  • ...,
  • "var3.upper"
  • "var4"

知道如何实现这一目标吗?有没有办法在R中“压平”列?像do.call之类的东西,但像JS或Python一样应用于休息?

使用quasiquotations可能会有一些事情要做,但它开始超越我的R技能..

我过去常常使用this gist plyr,但它不再适用于dplyr,而不是再次重新编码,我希望有一种比黑客更好的方法库。

4 个答案:

答案 0 :(得分:3)

如果我们将输出格式化为tidyr::unnest第一个

,我们可以使用data.frame

数据

library(Rmisc)
library(dplyr)
library(tidyr)
set.seed(1)
data <- data.frame(group1 = sample(c("A","B"),10,T),
                   group2 = sample(c("A","B"),10,T),
                   var1 = sample(10),
                   var2 = sample(10),
                   var3 = sample(10),
                   var4 = sample(10))

一般解决方案

data %>% group_by(group1, group2) %>%
  dplyr::summarize(var1 = list(data.frame(t(CI(var1, ci=0.95)))),
                   var2 = list(data.frame(t(CI(var2, ci=0.95)))),
                   var3 = list(data.frame(t(CI(var3, ci=0.95)))),
                   var4 = sum(var4)) %>%
  unnest (var1,var2,var3,.sep=".")

<强>结果

# A tibble: 4 x 12
# Groups:   group1 [2]
#   group1 group2  var4 var1.upper var1.mean var1.lower var2.upper var2.mean  var2.lower var3.upper var3.mean var3.lower
#   <fctr> <fctr> <int>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>       <dbl>      <dbl>     <dbl>      <dbl>
# 1      A      A    13  56.824819       6.0 -44.824819   11.85310  5.500000  -0.8531024   26.55931  7.500000 -11.559307
# 2      A      B    11  38.265512       6.5 -25.265512   50.97172  6.500000 -37.9717166   25.55931  6.500000 -12.559307
# 3      B      A    11  12.956686       4.0  -4.956686   13.65205  5.666667  -2.3187188   15.07146  5.666667  -3.738127
# 4      B      B    20   8.484138       6.0   3.515862   14.70619  4.666667  -5.3728564   11.31872  3.333333  -4.652052

或使用自定义CI功能(相同输出)

CI2 <- function(x,ci=0.95) list(data.frame(t(CI(x, ci))))

data %>% group_by(group1, group2) %>%
  dplyr::summarize(var1 = CI2(var1, ci=0.95),
                   var2 = CI2(var2, ci=0.95),
                   var3 = CI2(var3, ci=0.95),
                   var4 = sum(var4)) %>%
  unnest (var1,var2,var3,.sep=".")

或使用转换器功能(相同输出)

可以与任何其他返回数组的函数一起使用

vec2rowdf <- function(v) list(data.frame(t(v))) # creates a 1 row data.frame from a vector, wrapped in a list
data %>% group_by(group1, group2) %>%
  dplyr::summarize(var1 = CI(var1, ci=0.95) %>% vec2rowdf,
                   var2 = CI(var2, ci=0.95) %>% vec2rowdf,
                   var3 = CI(var3, ci=0.95) %>% vec2rowdf,
                   var4 = sum(var4)) %>%
  unnest (var1,var2,var3,.sep=".")

答案 1 :(得分:1)

“展平”由import { TestBed, inject } from '@angular/core/testing'; import { AngularFireAuth } from 'angularfire2/auth'; import 'rxjs/add/observable/of'; import { Observable } from 'rxjs/Rx'; import { AuthService } from './auth.service'; import { environment } from '../environments/environment'; describe('AuthService', () => { const mockAngularFireAuth: any = { auth: jasmine.createSpyObj('auth', { 'signInAnonymously': Promise.resolve('foo'), // 'signInWithPopup': Promise.reject(), // 'signOut': Promise.reject() }), authState: Observable.of(null) }; beforeEach(() => { TestBed.configureTestingModule({ providers: [ { provide: AngularFireAuth, useValue: mockAngularFireAuth }, { provide: AuthService, useClass: AuthService } ] }); }); it('should be created', inject([ AuthService ], (service: AuthService) => { expect(service).toBeTruthy(); })); // // // // // describe('when we can’t authenticate', () => { beforeEach(() => { mockAngularFireAuth.auth.signInAnonymously.and.returnValue(Promise.reject('bar')); }); it('should thow', inject([ AuthService ], (service: AuthService) => { expect(service).toThrow(); })); }); // // // // // }); (来自unnest)处理。您可以在汇总时创建列表,然后使用tidyr格式化,以便每个上/下/三元组显示为3行:

unnest

然后,您需要使用收集和传播

格式化您的tibble以获得所需的输出
res <- data %>% 
    group_by(group1, group2) %>% 
    summarize_at(vars(var1, var2, var3), funs(list(CI(., ci=0.95)))) %>%
    unnest(var1, var2, var3, .id = 'name')

很容易将其更改为使用res %>% group_by(group1, group2) %>% mutate(q = c('upper', 'mean', 'lower')) %>% ungroup %>% gather(var, val, var1, var2, var3) %>% mutate(var = paste(var, q, sep = '.')) %>% select(-q) %>% spread(var, val) 中的其他函数(将不同的函数应用于需要从summarize_*更改为summarize的不同列,并明确写出每个列的表达式summarize_atvar1var2

答案 2 :(得分:1)

这是另一种可能的方法,在总结var1-var4的结果后,按行使用unlist

library(dplyr)

summary <- dat %>%
  group_by(group1, group2) %>%
  summarize(
    var1 = list(Rmisc::CI(var1)),
    var2 = list(Rmisc::CI(var2)),
    var3 = list(Rmisc::CI(var3)),
    var4 = sum(var4)) %>%
  ungroup()

summary <- t(apply(summary, 1, unlist)) %>%     # unlist by row (this results in a character matrix)
  as.data.frame(stringsAsFactors = F) %>% # convert matrix back to data frame
  mutate_at(vars(var1.upper:var4),        # convert appropriate columns back to numeric
            as.numeric)

> str(summary)
'data.frame':   9 obs. of  12 variables:
 $ group1    : chr  "A" "A" "A" "B" ...
 $ group2    : chr  "d" "e" "f" "d" ...
 $ var1.upper: num  5.77 6.25 5.94 6.49 6.26 ...
 $ var1.mean : num  4.55 4.8 4.66 5.27 4.94 ...
 $ var1.lower: num  3.32 3.35 3.38 4.04 3.62 ...
 $ var2.upper: num  20.5 20.3 20.3 20.1 20.3 ...
 $ var2.mean : num  20.2 19.9 19.9 19.7 19.9 ...
 $ var2.lower: num  19.8 19.5 19.5 19.3 19.5 ...
 $ var3.upper: num  140 155 120 113 141 ...
 $ var3.mean : num  113.9 125.3 94.7 86.3 111.6 ...
 $ var3.lower: num  88.1 95.6 69.9 59.8 82.7 ...
 $ var4      : num  240042 205052 215986 231008 229010 ...

使用的样本数据:

set.seed(123)
n = 2000
dat <- data.frame(
  group1 = sample(LETTERS[1:3], n, replace = T),
  group2 = sample(letters[4:6], n, replace = T),
  var1 = rnorm(n, mean = 5, sd = 10),
  var2 = rnorm(n, mean = 20, sd = 3),
  var3 = rnorm(n, mean = 100, sd = 200),
  var4 = rnorm(n, mean = 1000, sd = 5)
)

答案 3 :(得分:1)

以下是使用tidyverse工具的一些更简洁的解决方案

set.seed(1)
data <- data.frame(group1 = sample(c("A","B"),10,T),
                   group2 = sample(c("A","B"),10,T),
                   var1 = sample(10),
                   var2 = sample(10),
                   var3 = sample(10),
                   var4 = sample(10))

library(dplyr, warn.conflicts = F)

第一个解决方案

通过创建功能列表并应用于列的子集来使用summarise_at功能

summarise_fun <- funs(
  mean = Rmisc::CI(., ci=0.95)['mean'],
  lower = Rmisc::CI(., ci=0.95)['lower'],
  upper = Rmisc::CI(., ci=0.95)['upper'])


summary_CI <- data %>%
  group_by(group1, group2) %>%
  summarize_at(vars(num_range("var", 1:3)), summarise_fun)
summary_CI
#> # A tibble: 4 x 11
#> # Groups:   group1 [?]
#>   group1 group2 var1_mean var2_mean var3_mean var1_lower  var2_lower
#>   <fctr> <fctr>     <dbl>     <dbl>     <dbl>      <dbl>       <dbl>
#> 1      A      A       6.0  5.500000  7.500000 -44.824819  -0.8531024
#> 2      A      B       6.5  6.500000  6.500000 -25.265512 -37.9717166
#> 3      B      A       4.0  5.666667  5.666667  -4.956686  -2.3187188
#> 4      B      B       6.0  4.666667  3.333333   3.515862  -5.3728564
#> # ... with 4 more variables: var3_lower <dbl>, var1_upper <dbl>,
#> #   var2_upper <dbl>, var3_upper <dbl>

您可以在对var4进行总结之后加入

inner_join(summary_CI, 
          data %>%
            group_by(group1, group2) %>%
            summarize_at("var4", sum),
          by = c("group1", "group2"))
#> # A tibble: 4 x 12
#> # Groups:   group1 [?]
#>   group1 group2 var1_mean var2_mean var3_mean var1_lower  var2_lower
#>   <fctr> <fctr>     <dbl>     <dbl>     <dbl>      <dbl>       <dbl>
#> 1      A      A       6.0  5.500000  7.500000 -44.824819  -0.8531024
#> 2      A      B       6.5  6.500000  6.500000 -25.265512 -37.9717166
#> 3      B      A       4.0  5.666667  5.666667  -4.956686  -2.3187188
#> 4      B      B       6.0  4.666667  3.333333   3.515862  -5.3728564
#> # ... with 5 more variables: var3_lower <dbl>, var1_upper <dbl>,
#> #   var2_upper <dbl>, var3_upper <dbl>, var4 <int>

第二种解决方案:使用purrr

的函数编程更简洁

基本上,它在一个块中完成了在解决方案1中完成的操作。它使用summarise_at.vars列表中的参数来应用.funs。结果是一个data.frame列表,您可以将其与join

组合在一起
library(purrr)
lst(.vars = lst(vars(num_range("var", 1:3)), "var4"),
    .fun  = lst(summarise_fun, sum)) %>%
      pmap(~ data %>%
             group_by(group1, group2) %>%
             summarise_at(.x, .y)) %>%
      reduce(inner_join, by = c("group1", "group2"))
#> # A tibble: 4 x 12
#> # Groups:   group1 [?]
#>   group1 group2 var1_mean var2_mean var3_mean var1_lower  var2_lower
#>   <fctr> <fctr>     <dbl>     <dbl>     <dbl>      <dbl>       <dbl>
#> 1      A      A       6.0  5.500000  7.500000 -44.824819  -0.8531024
#> 2      A      B       6.5  6.500000  6.500000 -25.265512 -37.9717166
#> 3      B      A       4.0  5.666667  5.666667  -4.956686  -2.3187188
#> 4      B      B       6.0  4.666667  3.333333   3.515862  -5.3728564
#> # ... with 5 more variables: var3_lower <dbl>, var1_upper <dbl>,
#> #   var2_upper <dbl>, var3_upper <dbl>, var4 <int>