Question

我的数据看起来像这样：

df <- data.frame(pop = c("Spades", "Spades", "Spades", "Clubs", "Clubs", "Clubs", "Diamonds", "Diamonds", "Hearts", "Hearts"),
            type = c("Ace", "Two", "Three", "Ace", "Two", "Three", "Ace", "Two", "King", "Queen"),
            V1 = c(4, 3, NA, 7, NA, NA, 5, 12, NA, NA),
            V2 = c(16, 23, NA, 15, NA, NA, 8, 19, NA, NA))

我需要将NA归为0，但仅限于非常具体的情况。对于每个pop（填充）和类型，数据（V1，V2等）必须包含所有NA或所有数字。所以在这个例子中，黑桃流行音乐在V1和V2中缺少黑桃三行的数据，而黑桃王牌和黑桃二星则有数据。因此，Spades-Three的V1和V2需要从NA变为0.同样适用于Clubs pop。

结果数据集应如下所示：

df2 <- data.frame(pop = c("Spades", "Spades", "Spades", "Clubs", "Clubs", "Clubs", "Diamonds", "Diamonds", "Hearts", "Hearts"),
            type = c("Ace", "Two", "Three", "Ace", "Two", "Three", "Ace", "Two", "King", "Queen"),
            V1 = c(4, 3, 0, 7, 0, 0, 5, 12, NA, NA),
            V2 = c(16, 23, 0, 15, 0, 0, 8, 19, NA, NA))

我可以使用此代码执行此插补：

ID <- unique(df$pop)  

for (i in 1:length(ID)) {
   dftemp <- filter(df, pop == paste(ID[i]))
   # Number of unique categories for a pop-type combination
   num_type <- length(dftemp$type)
   # Number of NA's in that combination for V1
   num_na <- sum(is.na(dftemp$V1) == TRUE)
   print(num_type)
   print(num_na)
   if (num_na < num_type && num_na > 0) {
     # print(paste(ID[i]))
     df$V1[with(df, pop == paste(ID[i]) & is.na(V1))] <- 0
     df$V2[with(df, pop == paste(ID[i]) & is.na(V2))] <- 0
   }
}

我的问题是扩大规模。我需要为更多列执行此操作，因此我想将列名放入一个列表中，然后我可以通过循环传递它。但出于某种原因，在上一个if循环中，从

更改

df$V1[with(df, pop == paste(ID[i]) & is.na(V1))] <- 0

到

df[newlist[k]][with(df, pop == paste(ID[i]) & is.na(newlist[k]))] <- 0

（其中newlist <- c("V1", "V2", "V3", "V4")等）使pop == paste(ID[i])条件不再有效。如果我指定pop == "Spades"，那么它可以工作，但显然这比旧方法效率更低。

最终目标是创建一个函数，我可以传递df名称和列列表以使其工作，但我发现自己因这个问题而陷入困境。

我目前编写函数的尝试看起来像这样：

imputezero <- function(df, columnlist) {
  for (i in 1:length(ID)) {
    for (x in 1:length(columnlist)) {
      dftemp <- filter(df, pop == paste(ID[i]))
      num_type <- length(dftemp$type)
      num_na <- sum(is.na(dftemp[collist[x]]) == TRUE)
      if (num_na < num_type && num_na > 0) {
        df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0
        return(df)
      }
    }
  }
}

list_status <- c("V1", "V2")
test_df <- imputezero(df, list_status)

那么我怎样才能让df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0工作？

如果我的一般方法完全错误或者有办法消除所有噪音，我也欢迎任何反馈。

Answer 1

您可以使用mutate_at中的dplyr来实现此目标，它可以按任意数量的列进行扩展

如果我理解正确，你可以这样做：

df %>%
  group_by(pop) %>%
  mutate_at(.funs = funs(ifelse(is.na(.) & sum(is.na(.)) != n(), 0, .)), 
            .vars = vars(-type))

Answer 2

当num_ra等于num_type或num_na为零时，我更改函数中的if以跳过循环。然后我执行df [columnlist [x]] [with（df，pop == paste（ID [i]）＆amp; is.na（df [columnlist [x]]））]＆lt; - 0行代码。我将return（df）移动到函数的末尾。这似乎有效。

imputezero <- function(df, columnlist) {
  for (i in 1:length(ID)) {
    for (x in 1:length(columnlist)) {
      dftemp <- filter(df, pop == paste(ID[i]))
      num_type <- length(dftemp$type)
      num_na <- sum(is.na(dftemp[columnlist[x]]) == TRUE)
      if (num_na == num_type | num_na == 0) {
        next
      }
        df[columnlist[x]][with(df, pop == paste(ID[i]) & is.na(df[columnlist[x]]))] <- 0

    }
  }
  return(df)
}

条件语句不同意循环中的列表索引

2 个答案: