R:基于索引的数据帧之间的NA替换

时间:2017-08-23 15:02:41

标签: r for-loop if-statement na substitution

我有这些数据df.1

   month a       b          c                  
    1    0 0.000000000 0.000000000  
    2    0 0.000000000 0.001503194  
    3    0 0.000000000 0.000000000 
    4    0 0.000000000 0.000000000  
    5    0 0.000000000 0.000000000  
    6    0 0.000000000 0.000000000  
    7    0 0.000000000 0.000000000  
    8    0 0.000000000 0.000000000  
    9    0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000  
    11  NA       NA          NA  
    12  NA       NA          NA  
    1   0 0.000000000 0.000000000 
    2   0 0.001537279 0.006917756  
    3   0 0.000000000 0.003669725  
    4   0 0.000000000 0.000000000  
    5   0 0.000000000 0.000000000  
    6   0 0.000000000 0.000000000  
    7   0 0.000000000 0.000000000  
    8   0 0.000000000 0.000000000  
    9   0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000
    11   0 0.000000000 0.013513514
    12  NA     NA          NA

此数据df.2

month     a         b         c
    1  0.03842077 0.002266291 0.000000000 
    2  0.01359501 0.001027937 0.000000000 
    3  0.08631519 0.008732519 0.001376147 
    4  0.26564710 0.083635347 0.019053692 
    5  0.34839088 0.152203121 0.021010075 
    6  0.31767367 0.152029019 0.029397773 
    7  0.31507761 0.110973916 0.023445471 
    8  0.29773872 0.096458381 0.026745770 
    9  0.31226976 0.109342562 0.023996392 
    10 0.23841220 0.081582743 0.021674228 
    11 0.04379016 0.003519300 0.000000000 
    12 0.02244389 0.002493766 0.000000000 

当第1列中的索引({{}时,我会在df.1 [,2:4]中用df.2 [,2:4]中的值替换值NA(并且只有NA) 1}})是一样的。我试过这段代码:

month

但结果是一个很大的新矩阵,其中res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3)) for (n in 1:12){ res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n])) } 中的每个NA值都替换为df.1中的所有值

怎么办? (我的实际数据框要大得多)

4 个答案:

答案 0 :(得分:1)

数据的前12行:

df.1 <- data.frame(
  month = 1:12, 
  a = c(rep(0, 10), NA, NA), 
  b = c(rep(0, 10), NA, NA), 
  c = c(0, 0.001503194, rep(0, 8), NA, NA)
)

df.2 <- data.frame(
  month = 1:12,
  a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367, 
        0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389), 
  b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121, 
        0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743, 
        0.0035193, 0.002493766 ), 
  c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
        0.02674577, 0.023996392, 0.021674228, 0, 0)
)

<强>解决方案

此解决方案仅允许一行中的某些列为NA。大数据可能需要一些时间才能完成工作。

for (row in 1:nrow(df.1)) {
  for (col in names(df.1)[-1]) {
    if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
      df.1[row, col] <- df.2[row, col]
    }
  }
}
df.1

   month          a           b           c
1      1 0.00000000 0.000000000 0.000000000
2      2 0.00000000 0.000000000 0.001503194
3      3 0.00000000 0.000000000 0.000000000
4      4 0.00000000 0.000000000 0.000000000
5      5 0.00000000 0.000000000 0.000000000
6      6 0.00000000 0.000000000 0.000000000
7      7 0.00000000 0.000000000 0.000000000
8      8 0.00000000 0.000000000 0.000000000
9      9 0.00000000 0.000000000 0.000000000
10    10 0.00000000 0.000000000 0.000000000
11    11 0.04379016 0.003519300 0.000000000
12    12 0.02244389 0.002493766 0.000000000

<强>解释

使用双循环,我们会检查ac列中的每个元素。如果该元素不是NA,我们将继续下一个元素。否则,我们会检查df.2中同一行中的月份是否相同,如果是TRUE,我们会将该元素替换为df.2中的相应行。

答案 1 :(得分:1)

假设您有完整的行,您希望填写缺失值,则可以使用whichmatch执行此操作。

# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]

请注意,缺失的行标识为!complete.cases。此外,使用nomatch = 0参数来忽略未找到匹配项的实例。

答案 2 :(得分:0)

也许不是最好的方法,但是这样的方法可行![/ p>

df1 <- data.frame(month = 1:12,
                  a = c(rep(1, 10), NA, NA),
                  b = c(rep(2, 11), NA))

df2 <- data.frame(month = 1:12,
                  a = rnorm(12),
                  b = rnorm(12))

# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")

# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)

# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df

也许为ifelse步骤创建一个函数,如果你需要插入很多列,如下所示:

impute <- function(df, col1, col2) {
 # impute col1 NA by col2 values creating a new column
 new_name <- paste("new", col1, by = "_")
 df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
 df
 }

impute(new_df, "a.x", "a.y")

答案 3 :(得分:0)

考虑到你有一个更大的数据帧,我会尽量避免合并表。您可以使用ifelse完成工作。

month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)

df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))

df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)

> df.1
   month a      b      c
1      1 0 0.0000 0.0000
2      2 0 0.0000 0.0015
3      3 0 0.0000 0.0000
4      4 0 0.0000 0.0000
5      5 0 0.0000 0.0000
6      6 0 0.0000 0.0000
7      7 0 0.0000 0.0000
8      8 0 0.0000 0.0000
9      9 0 0.0000 0.0000
10    10 0 0.0000 0.0000
11    11 1 2.0000 3.0000
12    12 1 2.0000 3.0000
13     1 0 0.0000 0.0000
14     2 0 0.0015 0.0069
15     3 0 0.0000 0.0036
16     4 0 0.0000 0.0000
17     5 0 0.0000 0.0000
18     6 0 0.0000 0.0000
19     7 0 0.0000 0.0000
20     8 0 0.0000 0.0000
21     9 0 0.0000 0.0000
22    10 0 0.0000 0.0000
23    11 0 0.0000 0.0135
24    12 1 2.0000 3.0000