如何将循环包含到我的数据子集中

时间:2016-03-30 11:48:06

标签: r loops dataframe

我有一个数据框,df:

ID <- c('ID1','ID2','ID3','ID4','ID5','ID6','ID7','ID8','ID9','ID10','ID11')
hr <- c(56,32,38,NA,42,23,35,23,25,44,32)
cr <- c(10,20,10,10,10,20,20,30,40,30,40)
desc <- c("yellow","blue","green","yellow","green","green","blue","yellow","blue","green","blue")
df <- data.frame(ID,hr,cr,desc)

我想将df $ cr的每个唯一值分隔成一个新子集(即将所有行组合为cr = 10或cr = 20等)。然后我想订购每个子集,并保留每个颜色描述的第一个唯一值(即如果黄色在df $ desc列中出现四次,我想只保留具有最低df $ hr值的行)。 / p>

我在代码中完成了这个:

cr10=subset(df,(df$cr==10))
cr10=cr10[order(cr10$hr) , ]
cr10=subset(cr10,!duplicated(desc))

cr20=subset(df,(df$cr==20))
cr20=cr20[order(cr20$hr) , ]
cr20=subset(cr20,!duplicated(desc))

cr30=subset(df,(df$cr==30))
cr30=cr30[order(cr30$hr) , ]
cr30=subset(cr30,!duplicated(desc))

cr40=subset(df,(df$cr==40))
cr40=cr40[order(cr40$hr) , ]
cr40=subset(cr40,!duplicated(desc))

df_new=rbind(cr10,cr20,cr30,cr40)
> df_new
     ID hr cr   desc
3   ID3 38 10  green
1   ID1 56 10 yellow
6   ID6 23 20  green
2   ID2 32 20   blue
8   ID8 23 30 yellow
10 ID10 44 30  green
9   ID9 25 40   blue

然而,这是非常冗长的。有没有办法缩短代码或加入一个循环,这样如果我有一千个cr值,我就不用输入1000次以上了?

3 个答案:

答案 0 :(得分:4)

您可以使用dplyr并执行此操作: df %>% group_by(cr, desc) %>% arrange(hr) %>% slice(1) %>% ungroup()

> df %>% group_by(cr, desc) %>% arrange(hr) %>% slice(1) %>% ungroup()
Source: local data frame [7 x 4]

ID    hr    cr   desc
(fctr) (dbl) (dbl) (fctr)
1    ID3    38    10  green
2    ID1    56    10 yellow
3    ID2    32    20   blue
4    ID6    23    20  green
5   ID10    44    30  green
6    ID8    23    30 yellow
7    ID9    25    40   blue

答案 1 :(得分:4)

使用data.table我会在快速排序数据集后使用它的unique方法。这将避免任何按组操作,并将使用完全优化的forderunique.data.table函数

library(data.table)
unique(setDT(df)[order(cr, hr)], by = c("cr", "desc"))
#      ID hr cr   desc
# 1:  ID3 38 10  green
# 2:  ID1 56 10 yellow
# 3:  ID6 23 20  green
# 4:  ID2 32 20   blue
# 5:  ID8 23 30 yellow
# 6: ID10 44 30  green
# 7:  ID9 25 40   blue

或者提议的data.table解决方案的dplyr等价物(如@Arun所述)

setDT(df)[order(hr), .SD[1L], keyby = .(cr, desc)]

或类似地,使用基数R,你可以做

res <- df[with(df, order(cr, hr)), ]
res[!duplicated(res[c("cr", "desc")]), ]
#      ID hr cr   desc
# 3   ID3 38 10  green
# 1   ID1 56 10 yellow
# 6   ID6 23 20  green
# 2   ID2 32 20   blue
# 8   ID8 23 30 yellow
# 10 ID10 44 30  green
# 9   ID9 25 40   blue

答案 2 :(得分:1)

dplyr是你的朋友

library(dplyr)
df %>%  group_by(cr, desc) %>% arrange(hr) %>% 
mutate(dup = duplicated(desc, cr)) %>% filter(dup == FALSE) %>% select(-dup)