Question

我有数千个这样的数据集：

>student1
    quantities score
[1]          4    10         
[2]          1    12         
[3]         78     5         
[4]          6   294

我想计算这名学生的分数中位数。对于每个分数，我们都有一些数量。在这种情况下，我希望它返回5，因为中位数是78 5中的一个。

我在这里查看了一些帖子，例如how to calculate the median on grouped dataset?，但我无法使用它，因为我有数千个数据集。

我也尝试过安装aroma.light包和matrixstats包但是，我仍然不能使用＆＃34; weighted.median函数＆＃34;事情。它告诉我

Error: could not find function "weightedMedians"

好的，上面只是一个例子，我的真实数据集如下：

>test
     [,1]          [,2]
info    3            10
info    2            20
        4      86779637
        1        135777
        7          2342

但是当我尝试使用

时

>rep(test[, 1], test[, 2])

出现

Error in rep(test[, 1], test[, 2]) : invalid 'times' argument
In addition: Warning message:
NAs introduced by coercion

我现在能做什么？

Answer 1

你可以使用：

median(rep(student1$score, student1$quantities))

这相对较快（模拟数据集为100k行只需几秒钟）

Answer 2

计算matrixStats包中加权中位数的函数称为weightedMedian()（没有复数＆＃39; s＆＃39;），例如

> library("matrixStats")
matrixStats v0.14.0 (2015-02-13) successfully loaded. See ?matrixStats for help.
> weightedMedian(student1$score, w=student1$quantities)
[1] 5.670732
> weightedMedian(student1$score, w=student1$quantities, interpolate=FALSE)
[1] 5

如何计算未分类数据集的中位数

2 个答案: