根据r中列的值添加遗漏值

时间:2017-12-31 14:44:55

标签: r for-loop dataframe rbind

这是我的样本数据集:

   vector1 <-
      data.frame(
        "name" = "a",
        "age" = 10,
        "fruit" = c("orange", "cherry", "apple"),
        "count" = c(1, 1, 1),
        "tag" = c(1, 1, 2)
      )
    vector2 <-
      data.frame(
        "name" = "b",
        "age" = 33,
        "fruit" = c("apple", "mango"),
        "count" = c(1, 1),
        "tag" = c(2, 2)
      )
    vector3 <-
      data.frame(
        "name" = "c",
        "age" = 58,
        "fruit" = c("cherry", "apple"),
        "count" = c(1, 1),
        "tag" = c(1, 1)
      )

    list <- list(vector1, vector2, vector3)
    print(list)

这是我的测试:

default <- c("cherry",
       "orange",
       "apple",
       "mango")

for (num in 1:length(list)) {
  #print(list[[num]])

  list[[num]] <- rbind(
    list[[num]],
    data.frame(
      "name" = list[[num]]$name,
      "age" = list[[num]]$age,
      "fruit" = setdiff(default, list[[num]]$fruit),#add missed value
      "count" = 0,
      "tag" = 1 #not found solutions
    )
  )

  print(paste0("--------------", num, "--------"))
  print(list)
}
#print(list)

我正在尝试在数据框中找到哪些水果遗漏,而水果是基于标记的值。例如,在第一个数据框中,有标记1和2.如果标记1的值没有默认水果,如苹果和香蕉,错过的默认水果将被添加到0到数据框。期望格式如下:

[[1]]
  name age  fruit count tag
1    a  10 orange     1   1
2    a  10 cherry     1   1
3    a  10  apple     1   2
4    a  10  mango     0   1
5    a  10  apple     0   1
6    a  10  mango     0   2
7    a  10  orange    0   2
8    a  10  cherry    0   2

当我检查循环的过程时,我还发现第一个循环添加芒果3次,我没有找到它无法一次添加错过的值的原因。整体输出如下:< / p>

[[1]]
  name age  fruit count tag
1    a  10 orange     1   1
2    a  10 cherry     1   1
3    a  10  apple     1   2
4    a  10  mango     0   1
5    a  10  mango     0   1
6    a  10  mango     0   1

[[2]]
  name age  fruit count tag
1    b  33  apple     1   2
2    b  33  mango     1   2
3    b  33 cherry     0   1
4    b  33 orange     0   1

[[3]]
  name age  fruit count tag
1    c  58 cherry     1   1
2    c  58  apple     1   1
3    c  58 orange     0   1
4    c  58  mango     0   1

有没有人帮助我并提供简单的方法或其他方法?我应该使用sqldf函数来添加0值吗?这是解决我问题的简单方法吗?

3 个答案:

答案 0 :(得分:2)

考虑基础R方法 - lapplyexpand.gridtransformrbindaggregate - 附加所有可能的水果标记选项并保留最大计数。

new_list <- lapply(list, function(df) {
  fruit_tag_df <- transform(expand.grid(fruit=c("apple", "cherry", "mango", "orange"),
                                        tag=c(1,2)),
                            name = df$name[1],
                            age = df$age[1],
                            count = 0)

  aggregate(.~name + age + fruit + tag, rbind(df, fruit_tag_df), FUN=max)
})

输出

new_list

# [[1]]
#   name age  fruit tag count
# 1    a  10  apple   1     0
# 2    a  10 cherry   1     1
# 3    a  10 orange   1     1
# 4    a  10  mango   1     0
# 5    a  10  apple   2     1
# 6    a  10 cherry   2     0
# 7    a  10 orange   2     0
# 8    a  10  mango   2     0

# [[2]]
#   name age  fruit tag count
# 1    b  33  apple   1     0
# 2    b  33  mango   1     0
# 3    b  33 cherry   1     0
# 4    b  33 orange   1     0
# 5    b  33  apple   2     1
# 6    b  33  mango   2     1
# 7    b  33 cherry   2     0
# 8    b  33 orange   2     0

# [[3]]
#   name age  fruit tag count
# 1    c  58  apple   1     1
# 2    c  58 cherry   1     1
# 3    c  58  mango   1     0
# 4    c  58 orange   1     0
# 5    c  58  apple   2     0
# 6    c  58 cherry   2     0
# 7    c  58  mango   2     0
# 8    c  58 orange   2     0

答案 1 :(得分:2)

OP要求完成list中的每个data.frame,以便default水果和标记1:2的所有组合都会出现在count应该是0的结果中为其他行设置为lapply()。最后,每个data.frame应至少包含 4 x 2 = 8 行。

我想提出两种不同的方法:

  1. 使用CJ()和来自data.table的{​​{1}}(交叉加入)功能返回列表。
  2. 使用listrbindlist()中的单独data.frames与一个大型data.table相结合,并对整个data.table应用所需的转换。
  3. 使用lapply()CJ()

    library(data.table)
    lapply(lst, function(x) setDT(x)[
      CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE), 
      on = .(name, age, fruit, tag)][
        is.na(count), count := 0][order(-count, tag)]
    )
    
    [[1]]
       name age  fruit count tag
    1:    a  10 cherry     1   1
    2:    a  10 orange     1   1
    3:    a  10  apple     1   2
    4:    a  10  apple     0   1
    5:    a  10  mango     0   1
    6:    a  10 cherry     0   2
    7:    a  10  mango     0   2
    8:    a  10 orange     0   2
    
    [[2]]
       name age  fruit count tag
    1:    b  33  apple     1   2
    2:    b  33  mango     1   2
    3:    b  33  apple     0   1
    4:    b  33 cherry     0   1
    5:    b  33  mango     0   1
    6:    b  33 orange     0   1
    7:    b  33 cherry     0   2
    8:    b  33 orange     0   2
    
    [[3]]
       name age  fruit count tag
    1:    c  58  apple     1   1
    2:    c  58 cherry     1   1
    3:    c  58  mango     0   1
    4:    c  58 orange     0   1
    5:    c  58  apple     0   2
    6:    c  58 cherry     0   2
    7:    c  58  mango     0   2
    8:    c  58 orange     0   2
    

    不需要按counttag排序,但有助于将结果与OP的预期输出进行比较。

    在大数据上创建。表

    我们可以使用一个大型data.table来代替具有相同结构的data.frames列表,其中每行的来源可以由id列标识。

    事实上,OP已经提出了其他问题("using lapply function and list in r" "how to loop the dataframe using sqldf?",他在处理数据列表时请求帮助。G. Grothendieck已经建议rbind行一起。

    rbindlist()函数有idcol参数,用于标识每行的来源:

    library(data.table)
    rbindlist(list, idcol = "df")
    
       df name age  fruit count tag
    1:  1    a  10 orange     1   1
    2:  1    a  10 cherry     1   1
    3:  1    a  10  apple     1   2
    4:  2    b  33  apple     1   2
    5:  2    b  33  mango     1   2
    6:  3    c  58 cherry     1   1
    7:  3    c  58  apple     1   1
    

    请注意,df包含list中的源data.frame的编号(如果list已命名,则包含列表元素的名称)。

    现在,我们可以通过对df

    进行分组来应用上述解决方案
    rbindlist(list, idcol = "df")[, .SD[
      CJ(name = name, age = age, fruit = default, tag = 1:2, unique = TRUE), 
      on = .(name, age, fruit, tag)], by = df][
        is.na(count), count := 0][order(df, -count, tag)]
    
        df name age  fruit count tag
     1:  1    a  10 cherry     1   1
     2:  1    a  10 orange     1   1
     3:  1    a  10  apple     1   2
     4:  1    a  10  apple     0   1
     5:  1    a  10  mango     0   1
     6:  1    a  10 cherry     0   2
     7:  1    a  10  mango     0   2
     8:  1    a  10 orange     0   2
     9:  2    b  33  apple     1   2
    10:  2    b  33  mango     1   2
    11:  2    b  33  apple     0   1
    12:  2    b  33 cherry     0   1
    13:  2    b  33  mango     0   1
    14:  2    b  33 orange     0   1
    15:  2    b  33 cherry     0   2
    16:  2    b  33 orange     0   2
    17:  3    c  58  apple     1   1
    18:  3    c  58 cherry     1   1
    19:  3    c  58  mango     0   1
    20:  3    c  58 orange     0   1
    21:  3    c  58  apple     0   2
    22:  3    c  58 cherry     0   2
    23:  3    c  58  mango     0   2
    24:  3    c  58 orange     0   2
        df name age  fruit count tag
    

答案 2 :(得分:1)

使用的解决方案。我们可以使用complete展开数据框,并将填充值指定为0到count

请注意,我将列表名称从list更改为fruit_list,因为在R中使用保留字来命名对象是一种不好的做法。另请注意,在创建示例数据框时,我设置了stringsAsFactors = FALSE,因为我不想创建因子列。最后,我使用lapply而不是for循环遍历列表元素。

library(dplyr)
library(tidyr)

fruit_list2 <- lapply(fruit_list, function(x){
  x2 <- x %>%
    complete(name, age, fruit = default, tag = c(1, 2), fill = list(count = 0)) %>%
    select(name, age, fruit, count, tag) %>%
    arrange(tag, fruit) %>%
    as.data.frame()
  return(x2)
})

fruit_list2
# [[1]]
#   name age  fruit count tag
# 1    a  10  apple     0   1
# 2    a  10 cherry     1   1
# 3    a  10  mango     0   1
# 4    a  10 orange     1   1
# 5    a  10  apple     1   2
# 6    a  10 cherry     0   2
# 7    a  10  mango     0   2
# 8    a  10 orange     0   2
# 
# [[2]]
#   name age  fruit count tag
# 1    b  33  apple     0   1
# 2    b  33 cherry     0   1
# 3    b  33  mango     0   1
# 4    b  33 orange     0   1
# 5    b  33  apple     1   2
# 6    b  33 cherry     0   2
# 7    b  33  mango     1   2
# 8    b  33 orange     0   2
# 
# [[3]]
#   name age  fruit count tag
# 1    c  58  apple     1   1
# 2    c  58 cherry     1   1
# 3    c  58  mango     0   1
# 4    c  58 orange     0   1
# 5    c  58  apple     0   2
# 6    c  58 cherry     0   2
# 7    c  58  mango     0   2
# 8    c  58 orange     0   2

数据

vector1 <-
  data.frame(
    "name" = "a",
    "age" = 10,
    "fruit" = c("orange", "cherry", "apple"),
    "count" = c(1, 1, 1),
    "tag" = c(1, 1, 2),
    stringsAsFactors = FALSE
  )
vector2 <-
  data.frame(
    "name" = "b",
    "age" = 33,
    "fruit" = c("apple", "mango"),
    "count" = c(1, 1),
    "tag" = c(2, 2),
    stringsAsFactors = FALSE
  )
vector3 <-
  data.frame(
    "name" = "c",
    "age" = 58,
    "fruit" = c("cherry", "apple"),
    "count" = c(1, 1),
    "tag" = c(1, 1),
    stringsAsFactors = FALSE
  )

fruit_list <- list(vector1, vector2, vector3)

default <- c("cherry", "orange", "apple", "mango")