R:在不等大小的组中识别最高等级的观察

时间:2014-07-28 19:08:11

标签: r grouping ranking

以下代码生成2级组(按测试中的状态),然后根据Grade的升序对每组中的每个观察进行排名。学校是打破平局。

School<-rep(c("A","B","C","D"),each=10)
State<-rep(c("NY","NJ"),times=20)
Test<-rep(c("LSAT", "MCAT", "GRE","TOEFL","ACT"), times=8)
Grade<-trunc(rep((seq(from=500, to=600,length.out=4))))
dat<-data.frame(Test,State,School,Grade)
library(plyr)
dat<-ddply(dat, .(Test, State),transform,num=rank(Grade,ties.method="first"))

我使用以下代码将每个组中排名第一的项目转换为“最低”:

dat$num[dat$num==1]<-"lowest"

在此示例df中,每组的项目数始终为4,因此我可以使用以下代码将每组中排名最高的项目转换为“最高”:

dat$num[dat$num==4]<-"highest"

但是当行数在所有组中不恒定时,如何用“最高”标记观察?以下代码创建了一个df版本,其中一个组中有两个额外的行。

School<-rep(c("A","B","C","D"),each=10)
State<-rep(c("NY","NJ"),times=20)
Test<-rep(c("LSAT", "MCAT", "GRE","TOEFL","ACT"), times=8)
Grade<-trunc(rep((seq(from=500, to=600,length.out=4))))
dat1<-data.frame(Test,State,School,Grade) 
dat1<-rbind(dat1,
     data.frame(Test="ACT",State="NJ",School="E",Grade=550),
     data.frame(Test="ACT",State="NJ",School="F",Grade=650))
library(plyr)
dat1<-ddply(dat1, .(Test, State),transform,num=rank(Grade,ties.method="first"))

4 个答案:

答案 0 :(得分:2)

您可以通过检查每个组中哪个是最高/最低并为这些行分配最高/最低来实现。在这里,我使用ddply来执行此操作,因为您已在代码中使用plyr

dat1 <- ddply(dat1, .(Test, State), transform, num=ifelse(num == max(num), "highest", 
                                                          ifelse(num == min(num), "lowest", num)))

> dat1
    Test State School Grade     num
1    ACT    NJ      A   533  lowest
2    ACT    NJ      B   600       4
3    ACT    NJ      C   533       2
4    ACT    NJ      D   600       5
5    ACT    NJ      E   550       3
6    ACT    NJ      F   650 highest
7    ACT    NY      A   500  lowest
8    ACT    NY      B   566       3
9    ACT    NY      C   500       2
10   ACT    NY      D   566 highest
11   GRE    NJ      A   600       3
12   GRE    NJ      B   533  lowest
13   GRE    NJ      C   600 highest
14   GRE    NJ      D   533       2
15   GRE    NY      A   566       3
16   GRE    NY      B   500  lowest
17   GRE    NY      C   566 highest
18   GRE    NY      D   500       2
19  LSAT    NJ      A   533  lowest
20  LSAT    NJ      B   600       3
21  LSAT    NJ      C   533       2
22  LSAT    NJ      D   600 highest
23  LSAT    NY      A   500  lowest
24  LSAT    NY      B   566       3
25  LSAT    NY      C   500       2
26  LSAT    NY      D   566 highest
27  MCAT    NJ      A   533  lowest
28  MCAT    NJ      B   600       3
29  MCAT    NJ      C   533       2
30  MCAT    NJ      D   600 highest
31  MCAT    NY      A   566       3
32  MCAT    NY      B   500  lowest
33  MCAT    NY      C   566 highest
34  MCAT    NY      D   500       2
35 TOEFL    NJ      A   600       3
36 TOEFL    NJ      B   533  lowest
37 TOEFL    NJ      C   600 highest
38 TOEFL    NJ      D   533       2
39 TOEFL    NY      A   500  lowest
40 TOEFL    NY      B   566       3
41 TOEFL    NY      C   500       2
42 TOEFL    NY      D   566 highest

如果您的数据足够大,您还可以考虑使用dplyrdata.table,这将比plyr更快。

答案 1 :(得分:0)

dplyrcut

一起使用
library(dplyr)
 dat1%>% 
 group_by(Test, State) %>%
 mutate(num=rank(Grade, ties.method="first"),
     Categ= cut(num, breaks=c(-Inf, min(num), max(num)-1, Inf), labels=c("lowest", "medium", "highest")))%>%
 arrange(Test,State,num)
 #Source: local data frame [42 x 6]
 #Groups: Test, State

 #    Test State School Grade num   Categ
 #1    ACT    NJ      A   533   1  lowest
 #2    ACT    NJ      C   533   2  medium
 #3    ACT    NJ      E   550   3  medium
 #4    ACT    NJ      B   600   4  medium
 #5    ACT    NJ      D   600   5  medium
 #6    ACT    NJ      F   650   6 highest
 #7    ACT    NY      A   500   1  lowest
 #8    ACT    NY      C   500   2  medium
 #9    ACT    NY      B   566   3  medium
 #10   ACT    NY      D   566   4 highest
 #11   GRE    NJ      B   533   1  lowest
 #12   GRE    NJ      D   533   2  medium
 #13   GRE    NJ      A   600   3  medium
 #14   GRE    NJ      C   600   4 highest
 #15   GRE    NY      B   500   1  lowest
 #16   GRE    NY      D   500   2  medium
 #17   GRE    NY      A   566   3  medium
 #18   GRE    NY      C   566   4 highest
 #19  LSAT    NJ      A   533   1  lowest
 #20  LSAT    NJ      C   533   2  medium
 #21  LSAT    NJ      B   600   3  medium
 #22  LSAT    NJ      D   600   4 highest
 #23  LSAT    NY      A   500   1  lowest
 #24  LSAT    NY      C   500   2  medium
 #25  LSAT    NY      B   566   3  medium
 #26  LSAT    NY      D   566   4 highest
 #27  MCAT    NJ      A   533   1  lowest
 #28  MCAT    NJ      C   533   2  medium
 #29  MCAT    NJ      B   600   3  medium
 #30  MCAT    NJ      D   600   4 highest
 #31  MCAT    NY      B   500   1  lowest
 #32  MCAT    NY      D   500   2  medium
 #33  MCAT    NY      A   566   3  medium
 #34  MCAT    NY      C   566   4 highest
 #35 TOEFL    NJ      B   533   1  lowest
 #36 TOEFL    NJ      D   533   2  medium
 #37 TOEFL    NJ      A   600   3  medium
 #38 TOEFL    NJ      C   600   4 highest
 #39 TOEFL    NY      A   500   1  lowest
 #40 TOEFL    NY      C   500   2  medium
 #41 TOEFL    NY      B   566   3  medium
 #42 TOEFL    NY      D   566   4 highest

答案 2 :(得分:0)

这是一个data.table解决方案:

setDT(dat1)
idx = dat1[, .I[c(which.min(num), which.max(num))], by="Test,State"]$V1
dat1[, num := as.character(num)][idx, num := c("lowest", "highest")]
#     Test State School Grade     num
# 1:   ACT    NJ      A   533  lowest
# 2:   ACT    NJ      B   600       4
# 3:   ACT    NJ      C   533       2
# 4:   ACT    NJ      D   600       5
# 5:   ACT    NJ      E   550       3
# 6:   ACT    NJ      F   650 highest
# 7:   ACT    NY      A   500  lowest
# 8:   ACT    NY      B   566       3
# ...
  1. dat1转换为data.table。
  2. 对于Test,State中的每个论坛,获取与dat1对应的每个最小值和最大值的行号,并将其存储在idx
  3. 首先将num转换为字符类型,然后使用idx对行进行分组,并使用numlowest更改highest的值R&#39> 回收功能。
  4. 请注意,如果某个群组只有一个值,则该值应为最小值和最大值,在这种情况下,此解决方案会为您提供highestlowest将被替换。)

答案 3 :(得分:0)

这是一个与原始答案基本相同的无包版本,使用ave()而不是4进行额外修正。这在提供的短数据集上更快,但可能不在更大的设置上。

# mark lowest
dat1[dat1$num == 1,'num'] <- 'lowest'

# mark highest
dat1[dat1$num == ave(x = dat1$num,list(dat1$Test,dat1$State),FUN = max),'num'] <- 'highest'