findInterval不使用R data.frame和查找表

时间:2016-01-28 17:54:34

标签: r

我正在尝试使用findInterval来查找数字列表所属的四分位数(第1,第2,第3,第4)。

我有一个查找矩阵:

> lookup
               0%      25%      50%      75%     100%
apple    3.846154 13.88889 18.11594 22.96296 47.22222
banana   5.882353 16.03694 20.53429 25.58937 47.82609
cucumber 6.060606 15.38462 18.75000 23.06815 39.47368
doritos  4.347826 14.43110 17.67830 22.81101 38.70968
elephant 7.582938 16.01732 18.71921 23.23232 36.28692
frog     2.439024 14.55696 18.70504 22.52252 36.14458
gorilla  3.448276 15.49895 19.59184 23.21852 34.78261
hangover 3.750000 10.71378 15.09434 18.09857 34.61538

和data.frame

  

DF   资料来源:当地数据表[1,426 x 2]

        cat      rate
     (fctr)     (dbl)
1   doritos  9.803922
2  hangover 22.968198
3    banana 12.658228
4  cucumber 12.643678
5  elephant 11.299435
6   gorilla 15.481172
7     apple 23.163842
8      frog 38.461538
9   doritos 14.563107
10 hangover 14.634146
..      ...       ...

但是当我跑步时: DF$level = findInterval(DF$rate, lookup[as.character(DF$cat), ], rightmost.closed = TRUE)

我收到此错误:Error in findInterval(DF$rate, lookup[as.character(DF$cat), ], rightmost.closed = TRUE) : 'vec' must be sorted non-decreasingly and not contain NAs

尽管矩阵正在排序。我可以像这样添加sort():DF$level = findInterval(DF$rate, sort(lookup[as.character(DF$cat), ]), rightmost.closed = TRUE)

然后我得到了奇怪的数字:

> DF
Source: local data table [1,426 x 3]

        cat      rate level
     (fctr)     (dbl) (int)
1   doritos  9.803922  1426
2  hangover 22.968198  4992
3    banana 12.658228  1605
4  cucumber 12.643678  1605
5  elephant 11.299435  1605
6   gorilla 15.481172  2497
7     apple 23.163842  5170
8      frog 38.461538  6417
9   doritos 14.563107  2140
10 hangover 14.634146  2140

如果我在data.frame的一行上运行命令,它似乎可以使用或不使用sort:

> findInterval(DF$rate[1], sort(lookup[as.character(DF$cat[1]), ]), rightmost.closed = TRUE)
[1] 1
> findInterval(DF$rate[2], lookup[as.character(DF$cat[2], ]), rightmost.closed = TRUE)
[1] 4

我使用percent_rank管理了一种解决方法,然后根据级别对每一行进行分类,但仍想知道为什么这不起作用。我想我错过了关于矢量化的一些东西。

1 个答案:

答案 0 :(得分:2)

类别和费率必须一次放入一个功能。 findInterval不会按原样DF$rate进行矢量化。函数mapply允许我们这样做:

DF$level <- mapply(function(x,y) {
  findInterval(x, lookup[as.character(y), ], rightmost.closed = TRUE)},
  DF$rate, DF$cat
)
DF
#         cat      rate level
# 1   doritos  9.803922     1
# 2  hangover 22.968198     4
# 3    banana 12.658228     1
# 4  cucumber 12.643678     1
# 5  elephant 11.299435     1
# 6   gorilla 15.481172     1
# 7     apple 23.163842     4
# 8      frog 38.461538     5
# 9   doritos 14.563107     2
# 10 hangover 14.634146     2

dplyr

DF %>% rowwise() %>% mutate(level=findInterval(rate, lookup[as.character(cat),],
                                 rightmost.closed=TRUE))