R数据表:将行值与组值进行比较,条件为

时间:2015-10-26 10:27:10

标签: r data.table

这是问题的延长:

R data table: compare row value to group values

我现在有:

x = data.table( id=c(1,1,1,1,1,1,1,1), price = c(10, 10, 12, 12, 12, 15, 
8, 11), subgroup = c(1, 1, 1, 1, 1, 1, 2, 2))

   id price subgroup
1:  1    10        1
2:  1    10        1
3:  1    12        1
4:  1    12        1
5:  1    12        1
6:  1    15        1
7:  1     8        2
8:  1    11        2

并希望计算每个ID价格较低的行数,但仅计算子组1 中的行数。

如果我使用:

x[,cheaper := rank(price, ties.method="min")-1, by=id]

结果是:

> x
   id price subgroup cheaper
1:  1    10        1       1   # only 1 is cheaper (row 7)
2:  1    10        1       1   # only 1 is cheaper (row 7)
3:  1    12        1       4   # 4 frows are cheaper (row 1,2,7,8)
4:  1    12        1       4   # etc
5:  1    12        1       4
6:  1    15        1       7
7:  1     8        2       0
8:  1    11        2       3

但我希望结果如下:

> x
   id price subgroup cheaper_in_subgroup_1
1:  1    10        1       0    # nobody in subgroup 1 is cheaper
2:  1    10        1       0    # nobody in subgroup 1 is cheaper
3:  1    12        1       2    # only row 1 and 2 are cheaper in subgroup 1
4:  1    12        1       2
5:  1    12        1       2
6:  1    15        1       5
7:  1     8        2       0    # nobody in subgroup 1 is cheaper
8:  1    11        2       2    # only row 1 and 2 are cheaper in subgroup 1

2 个答案:

答案 0 :(得分:2)

实现这一目标可能还有更多data.table方法,但此处尝试在每个vapply中使用id

x[, cheaper := vapply(price, 
                      function(x) sum(price[subgroup == 1L] < x),
                      FUN.VALUE = integer(1L)), 
               by = id]
x
#    id price subgroup cheaper
# 1:  1    10        1       0
# 2:  1    10        1       0
# 3:  1    12        1       2
# 4:  1    12        1       2
# 5:  1    12        1       2
# 6:  1    15        1       5
# 7:  1     8        2       0
# 8:  1    11        2       2

答案 1 :(得分:2)

这是使用滚动连接的小技巧的另一种方式:

y = x[subgroup==1L, .N, keyby=.(id, price+1L)][, N := cumsum(N)][]
#    id price N
# 1:  1    11 2
# 2:  1    13 5
# 3:  1    16 6
x[, cheaper := y[x, N, roll=TRUE, rollends=FALSE, on=c("id", "price")]]
#    id price subgroup cheaper
# 1:  1    10        1      NA
# 2:  1    10        1      NA
# 3:  1    12        1       2
# 4:  1    12        1       2
# 5:  1    12        1       2
# 6:  1    15        1       5
# 7:  1     8        2      NA
# 8:  1    11        2       2

我们的想法是获取每个id,price的累积总和,但将其存储为price+1L。这将导致x中的值在执行滚动连接时获得与上次观察相对应的计数。

PS:如果price不是整数类型,那么在获取price * (1 + eps)时它就是price + 1L而不是y