合并两个数据框以填充缺少的日期

时间:2016-01-24 01:39:16

标签: r

我有两个data.frames; (1)df1具有年份,州和收益率,(2)df2具有每个州的特定权重,但是在不同的年份间隔中。

我需要将df1与合并df2变量的w合并,以填补df2中缺少的年份。

为了澄清,在1910年到1919年之间的df1年,在w中为每个州和1910年使用变量df2,并在1920年和1921年使用变量每个州和每年的w 1920.由于df2中缺少的数据与df2不匹配,我想使用两个日期之间的年份来获取变量w。希望这很清楚。

示例数据:

DF1

df1 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1911L, 
1911L, 1911L, 1911L, 1911L, 1919L, 1920L, 1920L, 1920L, 1920L, 
1920L, 1921L, 1921L, 1921L, 1921L, 1921L), state = c("colorado", 
"kansas", "new mexico", "oklahoma", "texas", "colorado", "kansas", 
"new mexico", "oklahoma", "texas", "texas", "colorado", "kansas", 
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico", 
"oklahoma", "texas"), acre_yield = c("15.5", "19", "15", "16", 
"22", "14", "14.5", "19.5", "7", "11", "23", "18.5", "26.2", 
"20", "26", "20", "12", "22.8", "19.5", "23", "18")), .Names = c("year", 
"state", "acre_yield"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 9L, 10L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 
59L, 60L), class = "data.frame")

DF2

    df2 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1920L, 
1920L, 1920L, 1920L, 1920L), state = c("colorado", "kansas", 
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico", 
"oklahoma", "texas"), w = c(0.117773613611233, 0.332027298270738, 
0.0176064421992724, 0.492169193923849, 0.0404234519949076, 0.305574486110184, 
0.32107131682438, 0.0583601411807103, 0.264145354274187, 0.0508487016105393
)), .Names = c("year", "state", "w"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -10L))

期望的输出:

   year      state acre_yield    w
1  1910   colorado       15.5    0.11777
2  1910     kansas         19    0.33202
3  1910 new mexico         15    0.01761
4  1910   oklahoma         16    0.49217
5  1910      texas         22    0.04042
6  1911   colorado         14    0.11777
7  1911     kansas       14.5    0.33202
8  1911 new mexico       19.5    0.01761
9  1911   oklahoma          7    0.49217
10 1911      texas         11    0.04042
50 1919      texas         23    0.04042
51 1920   colorado       18.5    0.30557
52 1920     kansas       26.2    0.32107
53 1920 new mexico         20    0.05836
54 1920   oklahoma         26    0.26414
55 1920      texas         20    0.05084
56 1921   colorado         12    0.30557
57 1921     kansas       22.8    0.32107
58 1921 new mexico       19.5    0.05836
59 1921   oklahoma         23    0.26414
60 1921      texas         18    0.05084

3 个答案:

答案 0 :(得分:1)

单向,dplyr

library(dplyr)
df3 <- df1 %>% filter(year < 1920) %>% 
               left_join(filter(df2, year == 1910) %>% select(-year))
df3 <- df1 %>% filter(year >= 1920) %>% 
               left_join(filter(df2, year == 1920) %>% select(-year)) %>% 
               bind_rows(df3) %>% 
               arrange(year, state)

它被分成两个链,一个加入1920年以前的数据,另一个加入1920年后,连接两个,并进行排序。

根据评论更新:

将年份分为5年增量,并以这些增量加入df2值:

df1$year_factor <- cut(df1$year, seq(1900, 1950, 5), right = FALSE)
df2$year_factor <- cut(df2$year, seq(1900, 1950, 5), right = FALSE)
df3 <- df1 %>% left_join(select(df2, -year)) %>% select(-year_factor)

这实际上更简单,但它引入(并删除)一个虚拟变量,而cut可能有点挑剔;随心所欲地玩它。它产生:

   year      state acre_yield          w
1  1910   colorado       15.5 0.11777361
2  1910     kansas         19 0.33202730
3  1910 new mexico         15 0.01760644
4  1910   oklahoma         16 0.49216919
5  1910      texas         22 0.04042345
6  1911   colorado         14 0.11777361
7  1911     kansas       14.5 0.33202730
8  1911 new mexico       19.5 0.01760644
9  1911   oklahoma          7 0.49216919
10 1911      texas         11 0.04042345
11 1919      texas         23         NA
12 1920   colorado       18.5 0.30557449
13 1920     kansas       26.2 0.32107132
14 1920 new mexico         20 0.05836014
15 1920   oklahoma         26 0.26414535
16 1920      texas         20 0.05084870
17 1921   colorado         12 0.30557449
18 1921     kansas       22.8 0.32107132
19 1921 new mexico       19.5 0.05836014
20 1921   oklahoma         23 0.26414535
21 1921      texas         18 0.05084870

注意1919行的NA值;由于df2在1915年到1919年之间没有任何值,因此无需插入。要花费数十年时间,请将5中的seq更改为10,或者根据需要进行设置。

答案 1 :(得分:1)

以下是使用apply进行基础R的一种方法:

df1$w <- apply(df1, 1, function(row) {
    idx <- which(df2$state == row['state'] & df2$year <= row['year'])
    idx <- max(idx) # want the max year that matches
    return(df2$w[idx])
})
df1
#    year      state acre_yield          w
# 1  1910   colorado       15.5 0.11777361
# 2  1910     kansas         19 0.33202730
# 3  1910 new mexico         15 0.01760644
# 4  1910   oklahoma         16 0.49216919
# 5  1910      texas         22 0.04042345
# 6  1911   colorado         14 0.11777361
# 7  1911     kansas       14.5 0.33202730
# 8  1911 new mexico       19.5 0.01760644
# 9  1911   oklahoma          7 0.49216919
# 10 1911      texas         11 0.04042345
# 50 1919      texas         23 0.04042345
# 51 1920   colorado       18.5 0.30557449
# 52 1920     kansas       26.2 0.32107132
# 53 1920 new mexico         20 0.05836014
# 54 1920   oklahoma         26 0.26414535
# 55 1920      texas         20 0.05084870
# 56 1921   colorado         12 0.30557449
# 57 1921     kansas       22.8 0.32107132
# 58 1921 new mexico       19.5 0.05836014
# 59 1921   oklahoma         23 0.26414535
# 60 1921      texas         18 0.05084870

我不能保证这是最有效的方式,但这是我想到的第一件事。

答案 2 :(得分:1)

使用data.table中的滚动连接

require(data.table)
dt1[, w := dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]]

其中dt1dt2分别对应df1df2 data.tables

dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]为与dt2$w列对应的dt1的每个匹配行提取state,year。如果没有匹配,则检索最后的匹配值。这被称为最后一次观察结果(locf) join。