中间区间百分位数,data.table由多列组织

时间:2014-07-29 18:35:01

标签: r data.table

我有一个数据集,我想要计算中间区间百分位数(我从jlhoward获得了基本代码)。现在我尝试使用(by)为组添加条件。虽然它适用于单个条件,但是当我添加两个条件时它不起作用。 问题似乎与 by

有关
rm(list=ls())
library(data.table)
ID<-c(43574,43574,43574,43835,43835,43902,43902,44053,44053,44331,44331,44424,44424,44534,44534,44575,47161,47177,47177,47178,47178,47179,47179,47186,47186,47222,47222,47237,47237,47239,47239,47244,47244,47292,47292,47293,47293,47296,47296,47299,45519,45519,45768,45768,45912,46381,47291,47855,47927,47970,47970,48357,48357,500325,500345,500377,500419,500516,500516,500661,500789,501799,32474,34358,34358,34439,34798,36521,36521,36730,36730,37651,40621,41502,43544,45297,46929)
TOPIC<-c("M","M","R","M","R","R","M","R","M","M","R","M","R","M","R","M","M","M","R","R","M","M","R","M","R","M","R","R","M","M","R","M","R","R","M","M","R","M","R","R","R","M","R","M","R","R","R","M","R","R","M","M","R","R","R","R","R","R","M","R","R","R","R","M","R","R","M","M","R","M","R","R","R","R","R","R","R")
SCORE<-c(189,189,185,184,176,153,172,195,192,198,198,173,166,198,188,198,218,203,213,217,217,227,213,220,210,218,210,204,206,221,209,242,224,209,209,213,216,233,214,217,229,226,196,200,214,226,224,222,226,221,217,214,214,224,219,220,214,222,226,225,243,214,182,162,158,226,170,218,208,191,197,216,220,216,220,206,226)
GROUP<-c(2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,12,12,12,12,12,12,12,12,12,12,12,12,12,12,12)
CalculatedPtile<-c(0.44,0.44,0.50,0.28,0.36,0.07,0.06,0.79,0.61,0.83,0.93,0.17,0.21,0.83,0.64,0.83,0.50,0.04,0.50,0.83,0.38,0.79,0.50,0.63,0.33,0.50,0.33,0.04,0.13,0.71,0.17,0.96,0.96,0.17,0.21,0.29,0.71,0.88,0.63,0.83,0.91,0.83,0.03,0.08,0.19,0.81,0.63,0.58,0.81,0.47,0.42,0.25,0.19,0.63,0.34,0.41,0.19,0.53,0.83,0.72,0.97,0.19,0.14,0.13,0.05,0.91,0.38,0.88,0.41,0.63,0.23,0.55,0.73,0.55,0.73,0.32,0.91)
DT <- data.table(ID,TOPIC,SCORE,GROUP,CalculatedPtile)
#
#insert Splitting variable here and it works
#
ptile.dt <- DT[,sapply(SCORE,function(x)  (sum(SCORE==x)/2+sum(SCORE<x))/.N),by=list(GROUP,TOPIC)]$V1
DT$ptile<- round(ptile.dt,2)
View(DT)

拆分变量

#DT&LT; -DT [TOPIC ==&#34; M&#34;]

如果我插入并取消注释上面的(拆分变量), 它适用于TOPIC等于&#34; M&#34;,但当我评论它时,ptile.dt计算似乎没有按预期考虑TOPIC?

为什么???

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

我认为这可能是更好的思考方式。

DT[ , rank := rank(SCORE,ties.method="min"), by=list(GROUP,TOPIC)]
DT[ , ties := .N, by=list(GROUP,TOPIC,SCORE)]
DT[ , total :=.N, by=list(GROUP,TOPIC)]
DT[ , ptile := (ties*0.5+(rank-1))/total]
DT[ , ptile := round(ptile,2)]