我有一个数据框,df:
ID <- c('ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10','ID11')
hr <- c(56,32,38,NA,42,23,35,23,25,44,32)
cr <- c(10,20,10,10,10,20,20,30,40,30,40)
desc <- c("yellow","blue","green","yellow","green","green","blue","yellow","blue","green","blue")
df <- data.frame(ID,hr,cr,desc)
我想将df $ cr的每个唯一值分隔成一个新子集(即将所有行组合为cr = 10或cr = 20等)。然后我想订购每个子集,并保留每个颜色描述的第一个唯一值(即如果黄色在df $ desc列中出现四次,我想只保留具有最低df $ hr值的行)。 / p>
我在代码中完成了这个:
cr10=subset(df,(df$cr==10))
cr10=cr10[order(cr10$hr) , ]
cr10=subset(cr10,!duplicated(desc))
cr20=subset(df,(df$cr==20))
cr20=cr20[order(cr20$hr) , ]
cr20=subset(cr20,!duplicated(desc))
cr30=subset(df,(df$cr==30))
cr30=cr30[order(cr30$hr) , ]
cr30=subset(cr30,!duplicated(desc))
cr40=subset(df,(df$cr==40))
cr40=cr40[order(cr40$hr) , ]
cr40=subset(cr40,!duplicated(desc))
df_new=rbind(cr10,cr20,cr30,cr40)
> df_new
ID hr cr desc
3 ID3 38 10 green
1 ID1 56 10 yellow
6 ID6 23 20 green
2 ID2 32 20 blue
8 ID8 23 30 yellow
10 ID10 44 30 green
9 ID9 25 40 blue
然而,这是非常冗长的。有没有办法缩短代码或加入一个循环,这样如果我有一千个cr值,我就不用输入1000次以上了?
答案 0 :(得分:4)
您可以使用dplyr
并执行此操作:
df %>% group_by(cr, desc) %>% arrange(hr) %>% slice(1) %>% ungroup()
> df %>% group_by(cr, desc) %>% arrange(hr) %>% slice(1) %>% ungroup()
Source: local data frame [7 x 4]
ID hr cr desc
(fctr) (dbl) (dbl) (fctr)
1 ID3 38 10 green
2 ID1 56 10 yellow
3 ID2 32 20 blue
4 ID6 23 20 green
5 ID10 44 30 green
6 ID8 23 30 yellow
7 ID9 25 40 blue
答案 1 :(得分:4)
使用data.table
我会在快速排序数据集后使用它的unique
方法。这将避免任何按组操作,并将使用完全优化的forder
和unique.data.table
函数
library(data.table)
unique(setDT(df)[order(cr, hr)], by = c("cr", "desc"))
# ID hr cr desc
# 1: ID3 38 10 green
# 2: ID1 56 10 yellow
# 3: ID6 23 20 green
# 4: ID2 32 20 blue
# 5: ID8 23 30 yellow
# 6: ID10 44 30 green
# 7: ID9 25 40 blue
或者提议的data.table
解决方案的dplyr
等价物(如@Arun所述)
setDT(df)[order(hr), .SD[1L], keyby = .(cr, desc)]
或类似地,使用基数R,你可以做
res <- df[with(df, order(cr, hr)), ]
res[!duplicated(res[c("cr", "desc")]), ]
# ID hr cr desc
# 3 ID3 38 10 green
# 1 ID1 56 10 yellow
# 6 ID6 23 20 green
# 2 ID2 32 20 blue
# 8 ID8 23 30 yellow
# 10 ID10 44 30 green
# 9 ID9 25 40 blue
答案 2 :(得分:1)
dplyr
是你的朋友
library(dplyr)
df %>% group_by(cr, desc) %>% arrange(hr) %>%
mutate(dup = duplicated(desc, cr)) %>% filter(dup == FALSE) %>% select(-dup)