dplyr | tidyverse:将键值对集合成单个键值(长格式)

时间:2018-05-28 07:52:52

标签: r merge dplyr key-value tidyverse

合并两组键值对数据的规范dplyrtidyverse方式是什么?

第一个键值对是parameter - coeft

第二个键值对是param - value。皱纹是这些重复的值。

我想将它们合并为一个键值对。

dat <- tidyr::crossing(sim=c(1:5), 
                parameter=c('mu','sigma'), 
                param=c('sd','sd')
                ) %>%
        dplyr::mutate(coeft=rnorm(n=10)) %>%
        dplyr::mutate(value=sort(rep(rnorm(n=5),2)))
> dat
# A tibble: 10 x 5
  sim parameter param  coeft   value
  <int> <chr>     <chr>  <dbl>   <dbl>
1     1 mu        sd    -1.91  -0.601 
2     1 sigma     sd    -0.967 -0.601 
3     2 mu        sd    -1.95   0.0645
4     2 sigma     sd     0.676  0.0645
5     3 mu        sd    -0.891  0.673 
6     3 sigma     sd    -0.328  0.673 
7     4 mu        sd    -2.30   1.08  
8     4 sigma     sd     0.679  1.08  
9     5 mu        sd    -0.598  1.99  
10     5 sigma     sd    -0.339  1.99 

理想的结构:

# A tibble: 15 x 3
  sim parameter   coeft
  <int> <chr>     <dbl>
1     1 mu       -1.91
2     1 sigma    -0.967  
3     1 sd       -0.601
4     2 mu       -1.95
5     2 sigma    0.676  
6     2 sd       0.0645
...

2 个答案:

答案 0 :(得分:3)

以下是dplyr的方法(使用dplyr v0.7.4,Windows 7,R64位运行):

dat %>%
  spread(parameter, coeft) %>% #convert to wide format
  rename(sd = value) %>% #change the name of a column
  gather(parameter, coeft, c(4,5,3)) %>% #convert three disjointly located columns to long format, note the order of columns
  # gather(parameter, coeft, sd:sigma) %>% #convert three contiguously located columns to long format
  arrange(sim) %>% #order of rows
  select(-param) 

这会对某些版本的dplyr(0.7.4)发出警告,但不会发出警告(明天会发布一个没有错误的版本 - 当我检查时)。

warning: Warning message: In if (!is.finite(x)) return(FALSE) : the condition has length > 1 and only the first element will be used

在这种情况下,可以在没有警告的情况下运行:

dat %>%
  spread(parameter, coeft) %>% 
  dplyr::rename(sd = value) %>% 
  gather(parameter, coeft, "mu", "sigma", "sd") %>% 
  arrange(sim) %>% #order of rows
  select(-param) 

另请注意,如果您希望使用列排除表示法,则需要先排除param列。

dat %>%
  spread(parameter, coeft) %>% #convert to wide format
  rename(sd = value) %>% #change the name of a column
  select(-param) %>%
  gather(parameter, coeft, -sim) %>% #convert three contiguously located columns to long format
  arrange(sim) #order of rows

#output
     sim parameter  coeft
  <int> <chr>      <dbl>
 1     1 mu        -0.626
 2     1 sigma      0.184
 3     1 sd        -2.21 
 4     2 mu        -0.836
 5     2 sigma      1.60 
 6     2 sd        -0.621
 7     3 mu         0.330
 8     3 sigma     -0.820
 9     3 sd         0.390
10     4 mu         0.487
11     4 sigma      0.738
12     4 sd         1.12 
13     5 mu         0.576
14     5 sigma     -0.305
15     5 sd         1.51 

数据:

set.seed(1)
dat <- tidyr::crossing(sim=c(1:5), 
                       parameter=c('mu','sigma'), 
                       param=c('sd','sd')
) %>%
  dplyr::mutate(coeft=rnorm(n=10)) %>%
  dplyr::mutate(value=sort(rep(rnorm(n=5),2)))

答案 1 :(得分:0)

如果我们需要重塑“长期”的话。格式化多组列,然后melt中的data.table是一个选项

library(data.table)
dt <- unique(melt(setDT(dat), measure = list(2:3, 4:5),
       value.name = c('parameter', 'coeft')))[, variable := NULL][order(sim)]
dt
#    sim parameter   coeft
# 1:   1        mu -1.9100
# 2:   1     sigma -0.9670
# 3:   1        sd -0.6010
# 4:   2        mu -1.9500
# 5:   2     sigma  0.6760
# 6:   2        sd  0.0645
# 7:   3        mu -0.8910
# 8:   3     sigma -0.3280
# 9:   3        sd  0.6730
#10:   4        mu -2.3000
#11:   4     sigma  0.6790
#12:   4        sd  1.0800
#13:   5        mu -0.5980
#14:   5     sigma -0.3390
#15:   5        sd  1.9900

数据

dat <- structure(list(sim = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L
), parameter = c("mu", "sigma", "mu", "sigma", "mu", "sigma", 
"mu", "sigma", "mu", "sigma"), param = c("sd", "sd", "sd", "sd", 
"sd", "sd", "sd", "sd", "sd", "sd"), coeft = c(-1.91, -0.967, 
-1.95, 0.676, -0.891, -0.328, -2.3, 0.679, -0.598, -0.339), value = c(-0.601, 
-0.601, 0.0645, 0.0645, 0.673, 0.673, 1.08, 1.08, 1.99, 1.99)), 
.Names = c("sim", 
"parameter", "param", "coeft", "value"),
   class = "data.frame", row.names = c(NA, 
-10L))