我有这些数据df.1
:
month a b c
1 0 0.000000000 0.000000000
2 0 0.000000000 0.001503194
3 0 0.000000000 0.000000000
4 0 0.000000000 0.000000000
5 0 0.000000000 0.000000000
6 0 0.000000000 0.000000000
7 0 0.000000000 0.000000000
8 0 0.000000000 0.000000000
9 0 0.000000000 0.000000000
10 0 0.000000000 0.000000000
11 NA NA NA
12 NA NA NA
1 0 0.000000000 0.000000000
2 0 0.001537279 0.006917756
3 0 0.000000000 0.003669725
4 0 0.000000000 0.000000000
5 0 0.000000000 0.000000000
6 0 0.000000000 0.000000000
7 0 0.000000000 0.000000000
8 0 0.000000000 0.000000000
9 0 0.000000000 0.000000000
10 0 0.000000000 0.000000000
11 0 0.000000000 0.013513514
12 NA NA NA
此数据df.2
:
month a b c
1 0.03842077 0.002266291 0.000000000
2 0.01359501 0.001027937 0.000000000
3 0.08631519 0.008732519 0.001376147
4 0.26564710 0.083635347 0.019053692
5 0.34839088 0.152203121 0.021010075
6 0.31767367 0.152029019 0.029397773
7 0.31507761 0.110973916 0.023445471
8 0.29773872 0.096458381 0.026745770
9 0.31226976 0.109342562 0.023996392
10 0.23841220 0.081582743 0.021674228
11 0.04379016 0.003519300 0.000000000
12 0.02244389 0.002493766 0.000000000
当第1列中的索引({{}时,我会在df.1
[,2:4]中用df.2
[,2:4]中的值替换值NA(并且只有NA) 1}})是一样的。我试过这段代码:
month
但结果是一个很大的新矩阵,其中res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3))
for (n in 1:12){
res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n]))
}
中的每个NA值都替换为df.1
中的所有值
怎么办? (我的实际数据框要大得多)
答案 0 :(得分:1)
数据的前12行:
df.1 <- data.frame(
month = 1:12,
a = c(rep(0, 10), NA, NA),
b = c(rep(0, 10), NA, NA),
c = c(0, 0.001503194, rep(0, 8), NA, NA)
)
df.2 <- data.frame(
month = 1:12,
a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367,
0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389),
b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121,
0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743,
0.0035193, 0.002493766 ),
c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
0.02674577, 0.023996392, 0.021674228, 0, 0)
)
<强>解决方案强>
此解决方案仅允许一行中的某些列为NA
。大数据可能需要一些时间才能完成工作。
for (row in 1:nrow(df.1)) {
for (col in names(df.1)[-1]) {
if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
df.1[row, col] <- df.2[row, col]
}
}
}
df.1
month a b c
1 1 0.00000000 0.000000000 0.000000000
2 2 0.00000000 0.000000000 0.001503194
3 3 0.00000000 0.000000000 0.000000000
4 4 0.00000000 0.000000000 0.000000000
5 5 0.00000000 0.000000000 0.000000000
6 6 0.00000000 0.000000000 0.000000000
7 7 0.00000000 0.000000000 0.000000000
8 8 0.00000000 0.000000000 0.000000000
9 9 0.00000000 0.000000000 0.000000000
10 10 0.00000000 0.000000000 0.000000000
11 11 0.04379016 0.003519300 0.000000000
12 12 0.02244389 0.002493766 0.000000000
<强>解释强>
使用双循环,我们会检查a
到c
列中的每个元素。如果该元素不是NA
,我们将继续下一个元素。否则,我们会检查df.2
中同一行中的月份是否相同,如果是TRUE
,我们会将该元素替换为df.2
中的相应行。
答案 1 :(得分:1)
假设您有完整的行,您希望填写缺失值,则可以使用which
和match
执行此操作。
# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]
请注意,缺失的行标识为!complete.cases
。此外,使用nomatch = 0参数来忽略未找到匹配项的实例。
答案 2 :(得分:0)
也许不是最好的方法,但是这样的方法可行![/ p>
df1 <- data.frame(month = 1:12,
a = c(rep(1, 10), NA, NA),
b = c(rep(2, 11), NA))
df2 <- data.frame(month = 1:12,
a = rnorm(12),
b = rnorm(12))
# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")
# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)
# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df
也许为ifelse步骤创建一个函数,如果你需要插入很多列,如下所示:
impute <- function(df, col1, col2) {
# impute col1 NA by col2 values creating a new column
new_name <- paste("new", col1, by = "_")
df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
df
}
impute(new_df, "a.x", "a.y")
答案 3 :(得分:0)
考虑到你有一个更大的数据帧,我会尽量避免合并表。您可以使用ifelse
完成工作。
month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)
df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))
df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)
> df.1
month a b c
1 1 0 0.0000 0.0000
2 2 0 0.0000 0.0015
3 3 0 0.0000 0.0000
4 4 0 0.0000 0.0000
5 5 0 0.0000 0.0000
6 6 0 0.0000 0.0000
7 7 0 0.0000 0.0000
8 8 0 0.0000 0.0000
9 9 0 0.0000 0.0000
10 10 0 0.0000 0.0000
11 11 1 2.0000 3.0000
12 12 1 2.0000 3.0000
13 1 0 0.0000 0.0000
14 2 0 0.0015 0.0069
15 3 0 0.0000 0.0036
16 4 0 0.0000 0.0000
17 5 0 0.0000 0.0000
18 6 0 0.0000 0.0000
19 7 0 0.0000 0.0000
20 8 0 0.0000 0.0000
21 9 0 0.0000 0.0000
22 10 0 0.0000 0.0000
23 11 0 0.0000 0.0135
24 12 1 2.0000 3.0000