我的数据集如下:
salary number
1500-1600 110
1600-1700 180
1700-1800 320
1800-1900 460
1900-2000 850
2000-2100 250
2100-2200 130
2200-2300 70
2300-2400 20
2400-2500 10
如何计算此数据集的中位数?这是我尝试过的:
x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]",
"(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
"(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))
numbers cumsum
[1500-1600] 110 110
(1600-1700] 180 290
(1700-1800] 320 610
(1800-1900] 460 1070
(1900-2000] 850 1920
(2000,2100] 250 2170
(2100-2200] 130 2300
(2200-2300] 70 2370
(2300-2400] 20 2390
(2400-2500] 10 2400
在这里,您可以看到中途频率为2400/2
= 1200
。它位于1070
和1920
之间。因此,中位数类是(1900-2000]
组。您可以使用下面的公式来获得此结果:
中位数= L + h / f(n / 2-c)
其中:
L 是中位数类的下层边界 h 是中位数的大小,即中位数类的上下类边界之间的差异 f 是中位数的频率
c 是中位数类的先前累积频率 n / 2 是总数。观察值除以2(即总和 f / 2)
或者,中位数由以下方法定义:
在累积频率列中找到n / 2。
获取它所在的课程。
在代码中:
> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)
[1] 1915.294
现在我想做的是让上面的表达更优雅 - 即1900+(1200-1070)/(1920-1070)*(2000-1900)
。我怎样才能做到这一点?
答案 0 :(得分:6)
由于您已经知道了公式,因此创建一个函数来为您进行计算应该很容易。
在这里,我已经创建了一个基本功能来帮助您入门。该函数有四个参数:
frequencies
:频率向量(第一个示例中的“数字”)intervals
:一个2行matrix
,其列数与频率长度相同,第一行是较低的类边界,第二行是较高的类边界。或者,“intervals
”可以是data.frame
中的一列,您可以指定sep
(可能还有trim
)以使该功能自动创建所需的矩阵你。sep
:intervals
中“data.frame
”列中的分隔符。trim
:在尝试强制转换为数字矩阵之前需要删除的字符的正则表达式。函数内置了一种模式:trim = "cut"
。这会将正则表达式模式设置为从输入中删除(,),[和]。这是函数(注释显示我如何使用您的指令将它组合在一起):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
以下是使用的示例data.frame
:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
现在,我们可以做到:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
以下是对某些组成数据采取行动的功能示例:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
顺便说一句,根据您提供的示例数据,我认为您的某个范围中存在错误(除了一个以逗号分隔之外,所有内容都用短划线分隔),因为strsplit
默认情况下使用正则表达式进行拆分,您可以使用如下函数:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
答案 1 :(得分:4)
我这样写它是为了清楚地解释它是如何制定出来的。附加更紧凑的版本。
library(data.table)
#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
salaries_low = 100*c(15:24),
salaries_high = 100*c(16:25),
numbers = c(110,180,320,460,850,250,130,70,20,10)
)
#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
# salaries_low salaries_high numbers cumnumbers
# 1: 1500 1600 110 110
# 2: 1600 1700 180 290
# 3: 1700 1800 320 610
# 4: 1800 1900 460 1070
# 5: 1900 2000 850 1920
# 6: 2000 2100 250 2170
# 7: 2100 2200 130 2300
# 8: 2200 2300 70 2370
# 9: 2300 2400 20 2390
# 10: 2400 2500 10 2400
#identifying median group
mediangroup <- salarydata[
(cumnumbers - numbers) <= (max(cumnumbers)/2) &
cumnumbers >= (max(cumnumbers)/2)]
mediangroup
# salaries_low salaries_high numbers cumnumbers
# 1: 1900 2000 850 1920
#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]
#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
# [1] 1915.294
紧凑版 -
编辑:改为@ AnandaMahto建议的功能。另外,使用更通用的变量名称。library(data.table)
#Creating function
CalculateMedian <- function(
LowerBound,
UpperBound,
Obs
)
{
#calculating cumulative number of observations and n
dataset <- data.table(UpperBound, LowerBound, Obs)
dataset <- dataset[,cumObs := cumsum(Obs)]
n = dataset[,max(cumObs)]
#identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
median <- dataset[
(cumObs - Obs) <= (max(cumObs)/2) &
cumObs >= (max(cumObs)/2),
LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
]
return(median)
}
# Using function
CalculateMedian(
LowerBound = 100*c(15:24),
UpperBound = 100*c(16:25),
Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294
答案 2 :(得分:3)
(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"),
function(x) mean( as.numeric(x) ) ) )
[1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)
对称重中位数的推广可能需要寻找具有此类值的包。
答案 3 :(得分:0)
如果是median
或apply(yourobject,2,median)
,您是否尝试过matrix
或data.frame
?
答案 4 :(得分:0)
这样怎么样?假设每个频段均匀分布,为每个工资支架创建向量。然后从这些向量中生成一个大向量,并取中位数。与您类似,但结果略有不同。我不是数学家,所以方法可能不正确。
dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))
返回1915.353
答案 5 :(得分:-2)
我认为这个概念对你有用。
$salaries = array(
array("1500","1600"),
array("1600","1700"),
array("1700","1800"),
array("1800","1900"),
array("1900","2000"),
array("2000","2100"),
array("2100","2200"),
array("2200","2300"),
array("2300","2400"),
array("2400","2500"),
);
$numbers = array("110","180","320","460","850","250","130","70","20","10");
$cumsum = array();
$n = 0;
$count = 0;
foreach($numbers as $key=>$number){
$cumsum[$key] = $number;
$n += $number;
if($count > 0){
$cumsum[$key] += $cumsum[$key-1];
}
++$count;
}
$classIndex = 0;
foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
$classIndex = $key+1;
}
}
$classRange = $salaries[$classIndex];
$L = $classRange[0];
$h = (float) $classRange[1] - $classRange[0];
$f = $numbers[$classIndex];
$c = $numbers[$classIndex-1];
$Median = $L + ($h/$f)*(($n/2)-$c);
echo $Median;