数据从宽格式转换为长格式,并具有多个不同类型的重复列

时间:2020-04-26 18:06:11

标签: r dplyr tidyr data-wrangling

数据集描述了多个群集的多次重复测量,每个测量群集对都包含在单个列中。我想将数据整理成长格式,这样一列就可以提供有关群集的信息,但是每种度量都保留在自己的列中。

# Current format
df_wider <- data.frame(
  id = 1:5,
  fruit_1 = sample(fruit, size = 5),
  date_1 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_1 = sample(1:100, 5),
  fruit_2 = sample(fruit, size = 5),
  date_2 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_2 = sample(1:100, 5),
  fruit_3 = sample(fruit, size = 5),
  date_3 = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 5),
  number_3 = sample(1:100, 5)
)

# Desired format
df_longer <- data.frame(
  id = rep(1:5, each = 3),
  cluster = rep(1:3, 5),
  fruit = sample(fruit, size = 15),
  date = sample(seq(as.Date('2020/01/01'), as.Date('2020/05/01'), by="day"), 15),
  number = sample(1:100, 15)
)

实际数据集最多包含25个簇,每个簇有100多个测量值。我尝试在每次测量中使用tidyr::gather()tidyr::pivot_longer()进行迭代,但是结果中间数据帧的大小成倍增加。由于这些值属于不同的类,因此无法在单个tidyr::pivot_longer()步骤中尝试这样做。我无法想到一种将其向量化的方法。

2 个答案:

答案 0 :(得分:1)

您可以这样做:

library(tidyr)
library(dplyr)

df_wider %>% pivot_longer(-id, 
                          names_pattern = "(.*)_(\\d)", 
                          names_to = c(".value", "cluster"))

# A tibble: 15 x 5
      id cluster fruit        date       number
   <int> <chr>   <fct>        <date>      <int>
 1     1 1       olive        2020-04-21     50
 2     1 2       elderberry   2020-02-23     59
 3     1 3       cherimoya    2020-03-07      9
 4     2 1       jujube       2020-03-22     88
 5     2 2       mandarine    2020-03-06     45
 6     2 3       grape        2020-04-23     78
 7     3 1       nut          2020-01-26     53
 8     3 2       cantaloupe   2020-01-27     70
 9     3 3       durian       2020-02-15     39
10     4 1       chili pepper 2020-03-17     60
11     4 2       raisin       2020-04-14     20
12     4 3       cloudberry   2020-03-11      4
13     5 1       honeydew     2020-01-04     81
14     5 2       lime         2020-03-23     53
15     5 3       ugli fruit   2020-01-13     26

答案 1 :(得分:1)

我们可以使用melt中的data.table

library(data.table)
melt(setDT(df_wider), measure = patterns('^fruit', '^date', '^number' ), 
      value.name = c('fruit', 'date', 'number'), variable.name = 'cluster')
#    id cluster        fruit       date number
# 1:  1       1         date 2020-04-16     17
# 2:  2       1       quince 2020-01-27      7
# 3:  3       1      coconut 2020-04-19     33
# 4:  4       1  pomegranate 2020-02-27     55
# 5:  5       1    persimmon 2020-02-20     62
# 6:  1       2   kiwi fruit 2020-01-14    100
# 7:  2       2    cranberry 2020-03-15     97
# 8:  3       2     cucumber 2020-03-16      5
# 9:  4       2    persimmon 2020-03-06     81
#10:  5       2         date 2020-04-17     30
#11:  1       3      apricot 2020-04-13     86
#12:  2       3       banana 2020-04-17     42
#13:  3       3     bilberry 2020-02-23     88
#14:  4       3 blackcurrant 2020-02-25     10
#15:  5       3       raisin 2020-02-09     87
相关问题