根据条件以长格式合并2个数据集

时间:2020-04-11 11:51:03

标签: r dataframe

我有2个要合并的数据框。数据集之间的区别在于观察值的数量及其收集的方式。在df1中记录了2个不同的日期。每条记录都有一个索引,id1人的识别号,id2代表进行记录的天数(天必须不同)。还有一个Day变量,记录进行记录的星期几。

df2中,观察结果只是基于序列号和id1个人识别号记录的。每人只有一个观察结果。同样,这里还有一个Day变量,用于记录录制开始的时间。

我想从df2中识别出与df1在同一天记录的观察结果。

我试图创建一个newindex(对index和id1进行分组)以变长并根据天数进行合并。

Df1:-天表示进行观察的时间(例如,索引12; id1-表示仅1个人; id2表示2天-星期三id2 1和星期日id2 2)

    index id1 id2  Day         obs1 obs2 obs3
     12    1   1   Wednesday    1    11   12
     12    1   2   Sunday       2     0    0
    123    1   1   Tuesday      1     0    1
    123    1   2   Saturday     3     0    3
    123    2   1   Monday       2     2    4
    123    2   2   Saturday     1     0    8

df2:-这里的Day Day变量表示进行观察的起始日期(例如id 12 day2和id 123 day1)

index   id1  Day       day1 day2 day3 day4 day5 day6  day7   
 12      1    Tuesday     2    1    2    1    1    3    1    
123      1    Friday      0    3    0    3    3    0    3     

结果:

 index id1 id2   obs1 obs2 obs3 
 12      1   1     1   11    12   
 12      1   2     2    0     0
 123     1   2     3    0     3        
 123     2   2     1    0     8

样本数据

df1:

structure(list(index = c(12, 12, 123, 123, 123, 123), id1 = c(1, 
1, 1, 1, 2, 2), id2 = c(1, 2, 1, 2, 1, 2), Day = structure(c(5L, 
3L, 4L, 2L, 1L, 2L), .Label = c("Monday", "Saturday", "Sunday", 
"Tuesday", "Wednesday"), class = "factor"), obs1 = c(1, 2, 1, 
3, 2, 1), obs2 = c(11, 0, 0, 0, 2, 0), obs3 = c(12, 0, 1, 3, 
4, 8)), class = "data.frame", row.names = c(NA, -6L))

df2:

structure(list(index = c(12, 123), id1 = c(1, 1), Day = structure(2:1, .Label = c("Friday", 
"Tuesday"), class = "factor"), day1 = c(2, 0), day2 = c(1, 3), 
    day3 = c(2, 0), day4 = c(1, 3), day5 = c(1, 3), day6 = c(3, 
    0), day7 = c(1, 3)), class = "data.frame", row.names = c(NA, 
-2L))

2 个答案:

答案 0 :(得分:1)

我们可以得到df2 lin长格式,group_by index保留观察后发生的行,并基于{{1}将其与df1合并}和index

Day

然后可以使用library(dplyr) weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") df2 %>% mutate_at(vars(matches('day\\d+')), as.numeric) %>% tidyr::pivot_longer(cols = matches('day\\d+')) %>% group_by(index) %>% filter(row_number() >= match(Day, weekday)[1L]) %>% summarise(Day = match(Day, weekday)[1]) %>% inner_join(df1 %>%mutate(Day = match(Day, weekday)), by = 'index') %>% filter(Day.y >= Day.x) # index Day.x id1 id2 Day.y obs1 obs2 obs3 # <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> #1 12 2 1 1 3 1 11 12 #2 12 2 1 2 7 2 0 0 #3 123 5 1 2 6 3 0 3 #4 123 5 2 2 6 1 0 8 仅保留必需的列。

答案 1 :(得分:1)

来自melt的{​​{1}}的选项

data.table

如果数据集是library(data.table) weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") haven天,我们首先用labelled转换为factor

as_factor

或者使用library(haven) df1$Day <- as.character(as_factor(df1$Day)) df2$Day <- as.character(as_factor(df2$Day)) df1$Day <- match(df1$Day, weekday) dt2 <- melt(setDT(df2), measure = patterns('^day\\d+$'))[seq_len(.N) >= match(Day, weekday)[1L]][, .(Day = match(Day, weekday)[1]), index] merge(setDT(df1), dt2, by = 'index')[Day.y < Day.x] # index id1 id2 Day.x obs1 obs2 obs3 Day.y #1: 12 1 1 3 1 11 12 2 #2: 12 1 2 7 2 0 0 2 #3: 123 1 2 6 3 0 3 5 #4: 123 2 2 6 1 0 8 5 ,最好先返回tidyverse中的list列,然后再返回summarise(以防长度与行数不匹配)

unnest