选项1

Question

我有一个如下数据框。请注意，在day b上，product 1没有条目。这表明产品1当天没有售出，但我想要一行清楚这一点。也就是说，我想在day=b，product=1和sales=0中添加一行。我想对数据框中不存在的每一天产品对执行此操作，例如day c和product 3。我怎么能做到这一点？

  df <- data.frame(day=c(rep('a',3), rep('b',2), rep('c',2)), 
                   product = c(1:3, 2:3,1:2), 
                   sales = runif(7))

Answer 1

选项1

感谢@Frank提供更好的解决方案，使用tidyr：

library(tidyr)
complete(df, day, product, fill = list(sales = 0))

使用此方法，您不再需要担心选择产品名称等。

这给了你：

  day product      sales
1   a       1 0.52042809
2   b       1 0.00000000
3   c       1 0.46373882
4   a       2 0.11155348
5   b       2 0.04937618
6   c       2 0.26433153
7   a       3 0.69100939
8   b       3 0.90596172
9   c       3 0.00000000

选项2

您可以使用tidyr包（和dplyr）

执行此操作

df %>% 
  spread(product, sales, fill = 0) %>% 
  gather(`1`:`3`, key = "product", value = "sales")

哪个结果相同

这可以通过使用spread创建一个宽数据框，每个产品作为自己的列。参数fill = 0将使所有空单元格填充0（默认为NA）。

接下来，gather可以转换广泛的＆＃39;数据框回到原来的长期＆＃39;数据框。第一个参数是产品的列（在本例中为'1':'3'）。然后，我们将key和value设置为原始列名。

我建议选项1，但在某些情况下，选项2可能仍然有用。

这两个选项应该适用于您至少记录了一次销售的所有日子。如果缺少天数，建议您查看包padr，然后使用上面的tidyr完成其余工作。

Answer 2

如果速度是一个问题，可以选择自联接来填补缺失级别（参见Tomcat documentation）：

library(data.table)
setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
  is.na(sales), sales := 0.0][]

   day product      sales
1:   a       1 0.57406950
2:   a       2 0.04390324
3:   a       3 0.63809278
4:   b       1 0.00000000
5:   b       2 0.01203568
6:   b       3 0.61310815
7:   c       1 0.19049274
8:   c       2 0.61758172
9:   c       3 0.00000000

基准

创建100万行的基准数据减去10％的缺失= 0.9 M行：

n_day <-  1e3L
n_prod <- 1e3L
n_rows <- n_day * n_prod
# how many rows to remove?
n_miss <- n_rows / 10L
set.seed(1L)
df <- expand.grid(day = 1:n_day, product = 1:n_prod)
df$sales <- runif(n_rows)
#remove rows
df <- df[-sample.int(n_rows, n_miss), ]
str(df)

'data.frame': 900000 obs. of  3 variables:
 $ day    : int  1 2 3 5 6 7 8 9 11 12 ...
 $ product: int  1 1 1 1 1 1 1 1 1 1 ...
 $ sales  : num  0.266 0.372 0.573 0.202 0.898 ...
 - attr(*, "out.attrs")=List of 2
  ..$ dim     : Named int  1000 1000
  .. ..- attr(*, "names")= chr  "day" "product"
  ..$ dimnames:List of 2
  .. ..$ day    : chr  "day=   1" "day=   2" "day=   3" "day=   4" ...
  .. ..$ product: chr  "product=   1" "product=   2" "product=   3" "product=   4" ...

定义检查功能：

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(as.data.frame(values[[1]]), as.data.frame(x))))
}

运行基准：

library(data.table)
microbenchmark::microbenchmark(
  tidyr = tidyr::complete(df, day, product, fill = list(sales = 0)),
  dt = setDT(df)[CJ(day = day, product = product, unique = TRUE), on = .(day, product)][
    is.na(sales), sales := 0.0][],
  times = 3L,
  check = my_check
)

Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval cld
 tidyr 1253.3395 1258.0595 1323.5438 1262.7794 1358.6459 1454.5124     3   b
    dt   94.4451  100.2952  155.4575  106.1452  185.9638  265.7823     3  a

对于给定的问题大小为1 M行减去10％，tidyr解决方案的速度比data.table方法慢。

在类别中缺少值时将行插入数据框

2 个答案:

选项1

选项2

基准