合并一些行的新数据集

时间:2019-08-22 14:37:08

标签: r dataframe

这不是关于长而宽的形状的问题:!!!不要让它重复plz

假设我有:

 HouseholdID.  PersonID.   time.     dur.    age
      1            1         3        4       19
      1            2         3        4       29
      1            3         5        5       30
      1            1         5        5       18
      2            1         21       30      18
      2            2         21       30      30

在每个家庭中,有些人有相同的时间和时间。只想合并具有相同HouseholdID,time和dur的行

输出:

   HouseholdID.  PersonID.   time.   dur.   age.  HouseholdID.  PersonID.  time.    dur.    age
       1            1         3        4      19     1            2         3        4       29
       1            3         5        5      30     1            1         5        5       18
       2            1         21       30     18     2            2         21       30      30

1 个答案:

答案 0 :(得分:5)

一个选项是dcast中的data.table,可能需要多个value.var

library(data.table)
dcast(setDT(df1), HouseholdID. ~ rowid(HouseholdID.), 
      value.var = c("PersonID.", "time.", "dur.", "age"), sep="")
#   HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2 age1 age2
#1:            1          1          2      3      3     4     4   19   29
#2:            2          1          2     21     21    30    30   18   30

或者是pivot_wider开发版本中带有tidyr的选项

library(tidyr) # ‘0.8.3.9000’
library(dplyr)
df1 %>%
  group_by(HouseholdID.) %>% 
  mutate(rn = row_number()) %>% 
  pivot_wider(id_cols= HouseholdID., names_from = rn, 
              values_from = c(PersonID., time., dur., age), name_sep="")
# A tibble: 2 x 9
#  HouseholdID. PersonID.1 PersonID.2 time.1 time.2 dur.1 dur.2  age1  age2
#         <int>      <int>      <int>  <int>  <int> <int> <int> <int> <int>
#1            1          1          2      3      3     4     4    19    29
#2            2          1          2     21     21    30    30    18    30

更新

使用新的数据集,通过包含“时间”来扩展id列。和“ dur”。

dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.), 
          value.var = c("PersonID.", "age"), sep="")

如果“时间”需要重复的列。和“ dur”。 (虽然不清楚为什么需要它)

dcast(setDT(df2), HouseholdID. + time. + dur. ~ rowid(HouseholdID., time., dur.), 
       value.var = c("PersonID.", "time.", "dur.", "age"), sep="")[, 
            c('time.', 'dur.') := NULL][]
#   HouseholdID. PersonID.1 PersonID.2 time..11 time..12 dur..11 dur..12 age1 age2
#1:            1          1          2        3        3       4       4   19   29
#2:            1          3          1        5        5       5       5   30   18
#3:            2          1          2       21       21      30      30   18   30

或使用tidyverse

df2 %>% 
    group_by(HouseholdID., time., dur.) %>% 
    mutate(rn = row_number()) %>% 
    pivot_wider(id_cols= c(HouseholdID., time., dur.), names_from = rn, 
               values_from = c(PersonID., age), names_sep = "")
# A tibble: 3 x 7
#  HouseholdID. time.  dur. PersonID.1 PersonID.2  age1  age2
#         <int> <int> <int>      <int>      <int> <int> <int>
#1            1     3     4          1          2    19    29
#2            1     5     5          3          1    30    18
#3            2    21    30          1          2    18    30

注意:不建议使用duplicate列名,因为它可能导致列识别混乱。

数据

df1 <- structure(list(HouseholdID. = c(1L, 1L, 2L, 2L), PersonID. = c(1L, 
2L, 1L, 2L), time. = c(3L, 3L, 21L, 21L), dur. = c(4L, 4L, 30L, 
30L), age = c(19L, 29L, 18L, 30L)), class = "data.frame", row.names = c(NA, 
-4L))


df2 <- structure(list(HouseholdID. = c(1L, 1L, 1L, 1L, 2L, 2L), PersonID. = c(1L, 
2L, 3L, 1L, 1L, 2L), time. = c(3L, 3L, 5L, 5L, 21L, 21L), dur. = c(4L, 
4L, 5L, 5L, 30L, 30L), age = c(19L, 29L, 30L, 18L, 18L, 30L)), 
 class = "data.frame", row.names = c(NA, 
-6L))