我有两个data.frames
; (1)df1
具有年份,州和收益率,(2)df2
具有每个州的特定权重,但是在不同的年份间隔中。
我需要将df1
与合并df2
变量的w
合并,以填补df2
中缺少的年份。
为了澄清,在1910年到1919年之间的df1
年,在w
中为每个州和1910年使用变量df2
,并在1920年和1921年使用变量每个州和每年的w
1920.由于df2
中缺少的数据与df2
不匹配,我想使用两个日期之间的年份来获取变量w
。希望这很清楚。
示例数据:
DF1
df1 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1911L,
1911L, 1911L, 1911L, 1911L, 1919L, 1920L, 1920L, 1920L, 1920L,
1920L, 1921L, 1921L, 1921L, 1921L, 1921L), state = c("colorado",
"kansas", "new mexico", "oklahoma", "texas", "colorado", "kansas",
"new mexico", "oklahoma", "texas", "texas", "colorado", "kansas",
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico",
"oklahoma", "texas"), acre_yield = c("15.5", "19", "15", "16",
"22", "14", "14.5", "19.5", "7", "11", "23", "18.5", "26.2",
"20", "26", "20", "12", "22.8", "19.5", "23", "18")), .Names = c("year",
"state", "acre_yield"), row.names = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L,
59L, 60L), class = "data.frame")
DF2
df2 <- structure(list(year = c(1910L, 1910L, 1910L, 1910L, 1910L, 1920L,
1920L, 1920L, 1920L, 1920L), state = c("colorado", "kansas",
"new mexico", "oklahoma", "texas", "colorado", "kansas", "new mexico",
"oklahoma", "texas"), w = c(0.117773613611233, 0.332027298270738,
0.0176064421992724, 0.492169193923849, 0.0404234519949076, 0.305574486110184,
0.32107131682438, 0.0583601411807103, 0.264145354274187, 0.0508487016105393
)), .Names = c("year", "state", "w"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -10L))
期望的输出:
year state acre_yield w
1 1910 colorado 15.5 0.11777
2 1910 kansas 19 0.33202
3 1910 new mexico 15 0.01761
4 1910 oklahoma 16 0.49217
5 1910 texas 22 0.04042
6 1911 colorado 14 0.11777
7 1911 kansas 14.5 0.33202
8 1911 new mexico 19.5 0.01761
9 1911 oklahoma 7 0.49217
10 1911 texas 11 0.04042
50 1919 texas 23 0.04042
51 1920 colorado 18.5 0.30557
52 1920 kansas 26.2 0.32107
53 1920 new mexico 20 0.05836
54 1920 oklahoma 26 0.26414
55 1920 texas 20 0.05084
56 1921 colorado 12 0.30557
57 1921 kansas 22.8 0.32107
58 1921 new mexico 19.5 0.05836
59 1921 oklahoma 23 0.26414
60 1921 texas 18 0.05084
答案 0 :(得分:1)
单向,dplyr
:
library(dplyr)
df3 <- df1 %>% filter(year < 1920) %>%
left_join(filter(df2, year == 1910) %>% select(-year))
df3 <- df1 %>% filter(year >= 1920) %>%
left_join(filter(df2, year == 1920) %>% select(-year)) %>%
bind_rows(df3) %>%
arrange(year, state)
它被分成两个链,一个加入1920年以前的数据,另一个加入1920年后,连接两个,并进行排序。
将年份分为5年增量,并以这些增量加入df2
值:
df1$year_factor <- cut(df1$year, seq(1900, 1950, 5), right = FALSE)
df2$year_factor <- cut(df2$year, seq(1900, 1950, 5), right = FALSE)
df3 <- df1 %>% left_join(select(df2, -year)) %>% select(-year_factor)
这实际上更简单,但它引入(并删除)一个虚拟变量,而cut
可能有点挑剔;随心所欲地玩它。它产生:
year state acre_yield w
1 1910 colorado 15.5 0.11777361
2 1910 kansas 19 0.33202730
3 1910 new mexico 15 0.01760644
4 1910 oklahoma 16 0.49216919
5 1910 texas 22 0.04042345
6 1911 colorado 14 0.11777361
7 1911 kansas 14.5 0.33202730
8 1911 new mexico 19.5 0.01760644
9 1911 oklahoma 7 0.49216919
10 1911 texas 11 0.04042345
11 1919 texas 23 NA
12 1920 colorado 18.5 0.30557449
13 1920 kansas 26.2 0.32107132
14 1920 new mexico 20 0.05836014
15 1920 oklahoma 26 0.26414535
16 1920 texas 20 0.05084870
17 1921 colorado 12 0.30557449
18 1921 kansas 22.8 0.32107132
19 1921 new mexico 19.5 0.05836014
20 1921 oklahoma 23 0.26414535
21 1921 texas 18 0.05084870
注意1919行的NA
值;由于df2
在1915年到1919年之间没有任何值,因此无需插入。要花费数十年时间,请将5
中的seq
更改为10
,或者根据需要进行设置。
答案 1 :(得分:1)
以下是使用apply
进行基础R的一种方法:
df1$w <- apply(df1, 1, function(row) {
idx <- which(df2$state == row['state'] & df2$year <= row['year'])
idx <- max(idx) # want the max year that matches
return(df2$w[idx])
})
df1
# year state acre_yield w
# 1 1910 colorado 15.5 0.11777361
# 2 1910 kansas 19 0.33202730
# 3 1910 new mexico 15 0.01760644
# 4 1910 oklahoma 16 0.49216919
# 5 1910 texas 22 0.04042345
# 6 1911 colorado 14 0.11777361
# 7 1911 kansas 14.5 0.33202730
# 8 1911 new mexico 19.5 0.01760644
# 9 1911 oklahoma 7 0.49216919
# 10 1911 texas 11 0.04042345
# 50 1919 texas 23 0.04042345
# 51 1920 colorado 18.5 0.30557449
# 52 1920 kansas 26.2 0.32107132
# 53 1920 new mexico 20 0.05836014
# 54 1920 oklahoma 26 0.26414535
# 55 1920 texas 20 0.05084870
# 56 1921 colorado 12 0.30557449
# 57 1921 kansas 22.8 0.32107132
# 58 1921 new mexico 19.5 0.05836014
# 59 1921 oklahoma 23 0.26414535
# 60 1921 texas 18 0.05084870
我不能保证这是最有效的方式,但这是我想到的第一件事。
答案 2 :(得分:1)
使用data.table
中的滚动连接:
require(data.table)
dt1[, w := dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]]
其中dt1
和dt2
分别对应df1
和df2
data.tables 。
dt2[dt1, w, on=c("state", "year"), roll=Inf, rollends=TRUE]
为与dt2$w
列对应的dt1
的每个匹配行提取state,year
。如果没有匹配,则检索最后的匹配值。这被称为最后一次观察结果(locf) join。