Question

我有一个数据框，其中包含各个受试者的实验室结果。一些主题有重复记录，只是重复项在某些记录中缺少某些数据点，而在另一记录中则没有。

我正在尝试编写一个函数，该函数将从该主题可能存在的任何重复项中“填充”一行NA数据点。这是我尝试过的：

# example data with duplicate IDs, some with missing values

ir<-head(iris)
ir$unique_flower_ID<-1:6
ir<-rbind(ir, ir[c(1,3,5),])
ir[7:9, c(1,3)]<-NA
ir[c(1,3,5), c(2,4)]<-NA
ir<-ir[order(ir$unique_flower_ID),]

# function to run on a given dataframe (df) to 
# replace missing values in certain variables (vars) from duplicates
# as identified by a unique ID
replaceNAs_dupl <- function(df, ID, vars) {
  #identify duplicate IDs and subset the dataframe
  df_dupl<-data.frame(table(df[, ID]))
  df_dupl<-df[df[, ID] %in% df_dupl$Var1[which(df_dupl$Freq > 1)],]

  # loop through specified columns
  for(i in vars) {
    #create a mini-dataframe of ID and value for each column
    df_dupl_uni<-unique(df_dupl[which(!is.na(df_dupl[,i])), c(ID, i)])
    # replace missing data with data from duplicate record
    df[which(df[, ID] %in% df_dupl_uni[, ID]), i] <- df_dupl_uni[match(df[which(df[, ID] %in% df_dupl_uni[, ID]), ID], df_dupl_uni[, ID]), i]

    return(df)
    }      
}

# define the columns to run the function on by name
col_names<-colnames(ir[,1:4])

# pass ir to the function
ir2<-replaceNAs_dupl(ir, "unique_flower_ID", col_names)

输出有效，但仅适用于第一列；循环根本不会循环。

有人可以解释我在做什么错吗？
是否完全有更好的方法来完成我要尝试的操作？

Answer 1

就像@jdobres所说的那样，最初的问题是在循环之前return正在循环中。

我提供了以下代码作为替代实现：

library(dplyr)
ir %>%
  group_by(unique_flower_ID) %>%
  mutate_at(vars(Sepal.Length:Petal.Width), ~ if_else(is.na(.), na.omit(.)[1], .)) %>%
  ungroup()
# # A tibble: 9 x 6
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID
#          <dbl>       <dbl>        <dbl>       <dbl> <fct>              <int>
# 1          5.1         3.5          1.4         0.2 setosa                 1
# 2          5.1         3.5          1.4         0.2 setosa                 1
# 3          4.9         3            1.4         0.2 setosa                 2
# 4          4.7         3.2          1.3         0.2 setosa                 3
# 5          4.7         3.2          1.3         0.2 setosa                 3
# 6          4.6         3.1          1.5         0.2 setosa                 4
# 7          5           3.6          1.4         0.2 setosa                 5
# 8          5           3.6          1.4         0.2 setosa                 5
# 9          5.4         3.9          1.7         0.4 setosa                 6

工作原理：

按ID字段分组意味着，以下代码将对每个唯一ID执行一次；表示第一次调用mutate_at函数时，它将仅显示

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID
1           5.1          NA          1.4          NA  setosa                1
11           NA         3.5           NA         0.2  setosa                1

mutate_at在一个或多个列上执行相同的功能，在这种情况下，Sepal.Length和Petal.Width之间（包括）之间的所有列；

所调用的函数使用的是rlang的“波浪符号”，其中点.将被每一列内的数据矢量所取代，从而有效地完成了

if_else(is.na(Sepal.Length), na.omit(Sepal.Length)[1], Sepal.Length)
if_else(is.na(Sepal.Width), na.omit(Sepal.Width)[1], Sepal.Width)
if_else(is.na(Petal.Length), na.omit(Petal.Length)[1], Petal.Length)
if_else(is.na(Petal.Width), na.omit(Petal.Width)[1], Petal.Width)

（本来可以很容易mutate_at(..., function(a) if_else(is.na(a), na.omit(a)[1], a))，但是我喜欢更紧凑的~表示法）

，对于向量中的每个值，如果它不是非NA，则使用它而无需更改；如果它是NA，则它将NA替换为该帧中的第一个非NA值（“ first”表示该帧中的first，因此，如果有多个不同的值，您必须按照行的顺序控制哪个优先级）；
这可以防止使用na.omit(.)[1]在列中没有可用数据：如果na.omit(.)返回 nothing （长度为0的向量，如{{1 }}），然后na.omit(NA)强制它返回某物，在我们的例子中，它是（另一个）[1]，因此我们保留了一个完整的向量。例如：
```
NA
```

（PS：由于您是R的新手，所以我需要澄清：ir$Sepal.Length[1:2] <- NA ir %>% group_by(unique_flower_ID) %>% mutate_at(vars(Sepal.Length:Petal.Width), ~ if_else(is.na(.), na.omit(.)[1], .)) %>% ungroup() # # A tibble: 9 x 6 # Sepal.Length Sepal.Width Petal.Length Petal.Width Species unique_flower_ID # <dbl> <dbl> <dbl> <dbl> <fct> <int> # 1 NA 3.5 1.4 0.2 setosa 1 # 2 NA 3.5 1.4 0.2 setosa 1 # 3 4.9 3 1.4 0.2 setosa 2 # 4 4.7 3.2 1.3 0.2 setosa 3 # 5 4.7 3.2 1.3 0.2 setosa 3 # 6 4.6 3.1 1.5 0.2 setosa 4 # 7 5 3.6 1.4 0.2 setosa 5 # 8 5 3.6 1.4 0.2 setosa 5 # 9 5.4 3.9 1.7 0.4 setosa 6代字号的使用对rlang软件包是唯一的；在其他软件包/中不一定可用函数，除非有明确规定，否则应该使用更通用的匿名函数（例如tidyverse或命名函数）。

Answer 2

这是用于合并记录的简单（但有些天真）的解决方案。

library(dplyr)
ir2 <- ir %>% 
  group_by(unique_flower_ID) %>% 
  summarise_if(is.numeric, mean, na.rm=TRUE) %>% 
  ungroup()

限制：

这将合并记录，这意味着不再有重复项，这可能是不希望的。
如果有两个重复的记录不匹配，则取平均值。 mean可以用另一个摘要函数代替，但是如果给定列中有两条具有相同ID但值不同的记录，则可能会抛出某种错误。
如果所有具有给定ID的记录的一列中都包含NA，则它将返回NaN。

在函数

2 个答案: