我正在尝试使用findInterval来查找数字列表所属的四分位数(第1,第2,第3,第4)。
我有一个查找矩阵:
> lookup
0% 25% 50% 75% 100%
apple 3.846154 13.88889 18.11594 22.96296 47.22222
banana 5.882353 16.03694 20.53429 25.58937 47.82609
cucumber 6.060606 15.38462 18.75000 23.06815 39.47368
doritos 4.347826 14.43110 17.67830 22.81101 38.70968
elephant 7.582938 16.01732 18.71921 23.23232 36.28692
frog 2.439024 14.55696 18.70504 22.52252 36.14458
gorilla 3.448276 15.49895 19.59184 23.21852 34.78261
hangover 3.750000 10.71378 15.09434 18.09857 34.61538
和data.frame
DF 资料来源:当地数据表[1,426 x 2]
cat rate
(fctr) (dbl)
1 doritos 9.803922
2 hangover 22.968198
3 banana 12.658228
4 cucumber 12.643678
5 elephant 11.299435
6 gorilla 15.481172
7 apple 23.163842
8 frog 38.461538
9 doritos 14.563107
10 hangover 14.634146
.. ... ...
但是当我跑步时:
DF$level = findInterval(DF$rate, lookup[as.character(DF$cat), ], rightmost.closed = TRUE)
我收到此错误:Error in findInterval(DF$rate, lookup[as.character(DF$cat), ], rightmost.closed = TRUE) :
'vec' must be sorted non-decreasingly and not contain NAs
尽管矩阵正在排序。我可以像这样添加sort():DF$level = findInterval(DF$rate, sort(lookup[as.character(DF$cat), ]), rightmost.closed = TRUE)
然后我得到了奇怪的数字:
> DF
Source: local data table [1,426 x 3]
cat rate level
(fctr) (dbl) (int)
1 doritos 9.803922 1426
2 hangover 22.968198 4992
3 banana 12.658228 1605
4 cucumber 12.643678 1605
5 elephant 11.299435 1605
6 gorilla 15.481172 2497
7 apple 23.163842 5170
8 frog 38.461538 6417
9 doritos 14.563107 2140
10 hangover 14.634146 2140
如果我在data.frame的一行上运行命令,它似乎可以使用或不使用sort:
> findInterval(DF$rate[1], sort(lookup[as.character(DF$cat[1]), ]), rightmost.closed = TRUE)
[1] 1
> findInterval(DF$rate[2], lookup[as.character(DF$cat[2], ]), rightmost.closed = TRUE)
[1] 4
我使用percent_rank
管理了一种解决方法,然后根据级别对每一行进行分类,但仍想知道为什么这不起作用。我想我错过了关于矢量化的一些东西。
答案 0 :(得分:2)
类别和费率必须一次放入一个功能。 findInterval
不会按原样DF$rate
进行矢量化。函数mapply
允许我们这样做:
DF$level <- mapply(function(x,y) {
findInterval(x, lookup[as.character(y), ], rightmost.closed = TRUE)},
DF$rate, DF$cat
)
DF
# cat rate level
# 1 doritos 9.803922 1
# 2 hangover 22.968198 4
# 3 banana 12.658228 1
# 4 cucumber 12.643678 1
# 5 elephant 11.299435 1
# 6 gorilla 15.481172 1
# 7 apple 23.163842 4
# 8 frog 38.461538 5
# 9 doritos 14.563107 2
# 10 hangover 14.634146 2
或dplyr
:
DF %>% rowwise() %>% mutate(level=findInterval(rate, lookup[as.character(cat),],
rightmost.closed=TRUE))