Question

我有一个包含某些组的数据集，我想计算每个组中的记录数，其中满足某个条件。然后我想将结果扩展到每个组中的其余记录（即不满足条件的地方），因为我稍后会折叠表。

我正在使用data.table执行此操作，而.N函数用于计算满足条件的每个组中的记录数。然后，我获得每个组中所有值的最大值，以将结果应用于每个组中的所有记录。我的数据集非常大（近500万条记录）。

我一直收到以下错误：

  Error in `[.data.table`(dpart, , `:=`(clustersize4wk, max(clustersize4wk,  : 
  Type of RHS ('double') must match LHS ('integer'). To check and coerce would impact performance too much for the fastest cases. Either change the type of the target column, or coerce the RHS of := yourself (e.g. by using 1L instead of 1)

首先，我假设使用.N生成一个整数，而按组获取值的最大值是产生一个双倍，但事实并非如此（在下面的玩具示例中，结果列的类在整个过程中保持为整数）并且我无法重现该问题。

为了说明，这是一个例子：

# Example data:

mydt <- data.table(id = c("a", "a", "b", "b", "b", "c", "c", "c", "c", "d", "d", "d"),
                   grp = c("G1", "G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2", "G2", "G2"),
                   name = c("Jack", "John", "Jill", "Joe", "Jim", "Julia", "Simran", "Delia", "Aurora", "Daniele", "Joan", "Mary"),
                   sex = c("m", "m", "f", "m", "m", "f", "m", "f", "f", "f", "f", "f"), 
                   age = c(2,12,29,15,30,75,5,4,7,55,43,39), 
                   reportweek = c("201740", "201750", "201801", "201801", "201801", "201748", "201748", "201749", "201750", "201752", "201752", "201801"))

我正在计算每组中男性的数字：

mydt[sex == "m", csize := .N, by = id]

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

某些群组不包含任何男性，因此为了避免在下一步中获得Inf，我将NA重新编码为0：

mydt[ is.na(csize), csize := 0]

然后我将结果扩展到每个组中的所有成员，如下所示：

mydt[, csize := max(csize, na.rm = T), by = id] 

> is.integer(mydt$csize)
[1] TRUE
> is.double(mydt$csize)
[1] FALSE

这是我的真实数据集中出现错误的点。如果我省略了将NAs重新编码为0的步骤，我可以使用示例数据重现错误;否则没有。还有我的真实数据集（尽管已将NAs重新编码为0），我仍然收到以下警告：

19: In max(clustersize4wk, na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

我该如何解决这个问题？

我的预期产量低于：

> mydt
    id grp    name sex age reportweek csize
 1:  a  G1    Jack   m   2     201740     2
 2:  a  G1    John   m  12     201750     2
 3:  b  G1    Jill   f  29     201801     2
 4:  b  G1     Joe   m  15     201801     2
 5:  b  G1     Jim   m  30     201801     2
 6:  c  G2   Julia   f  75     201748     1
 7:  c  G2  Simran   m   5     201748     1
 8:  c  G2   Delia   f   4     201749     1
 9:  c  G2  Aurora   f   7     201750     1
10:  d  G2 Daniele   f  55     201752     0
11:  d  G2    Joan   f  43     201752     0
12:  d  G2    Mary   f  39     201801     0

Answer 1

实际问题是csize的数据类型。它的类型为integer。 max会返回double类型。

修复可能是：

mydt[sex == "m", csize := as.double(.N), by = id]

mydt[, csize := max(csize, 0, na.rm = TRUE), by = id]

r data.table避免RHS和LHS之间的类差异

1 个答案: