我基本上希望从数据框中删除重复项,并在列中保留最低值,按两列(名称和群集)分组。例如,这里是我的数据帧:
Name cluster score
19 Steve a1 30
51 Steve a2 30
83 Steve a2 -28
93 Steve a2 -38
115 Bob a4 30
147 Bob a5 -8
179 Bob a5 30
在pandas和sql中,这将由一个groupby完成,但我很难在R中弄明白,甚至真的开始了。我试过做一个双重名称和集群。第一个groupby是Name,然后是cluster。所以既然有三个史蒂夫,a2'我只想保持得分最低的那个。
我想要的输出如下:
Name cluster score
19 Steve a1 30
93 Steve a2 -38
115 Bob a4 30
147 Bob a5 -8
任何帮助将不胜感激
答案 0 :(得分:2)
这是有效的
library(dplyr)
Name=c("Steve", "Steve", "Steve", "Steve", "Bob", "Bob", "Bob")
cluster=c("a1", "a2", "a2", "a2", "a4", "a5", "a5")
score=c(30,30,-28,-38,30,-8,30)
yourdf<-data.frame(Name,cluster,score)
yourdf %>%
group_by(Name,cluster) %>%
filter(score == min(score))
Name cluster score
1 Steve a1 30
2 Steve a2 -38
3 Bob a4 30
4 Bob a5 -8
答案 1 :(得分:2)
一个简单的data.table
解决方案
library(data.table)
setDT(df)[, list(score = score[which.min(score)]), by = list(Name, cluster)]
# Name cluster score
# 1: Steve a1 30
# 2: Steve a2 -38
# 3: Bob a4 30
# 4: Bob a5 -8
答案 2 :(得分:2)
这适用于aggregate
。
> aggregate(score ~ Name + cluster, mydf, min)
# Name cluster score
# 1 Steve a1 30
# 2 Steve a2 -38
# 3 Bob a4 30
# 4 Bob a5 -8
其中mydf
是您的原始数据。
答案 3 :(得分:1)
这是一个基础R方法:
# Read in sample data
df<-read.table(text="
Name cluster score
19 Steve a1 30
51 Steve a2 30
83 Steve a2 -28
93 Steve a2 -38
115 Bob a4 30
147 Bob a5 -8
179 Bob a5 30", header=TRUE)
# order it
df_sorted <- df[with(df, order(Name, cluster, score)),]
# get rid of duplicated names and clusters, keeping the first,
# which will be the minimum score due to the sorting.
df_sorted[!duplicated(df_sorted[,c('Name','cluster')]), ]
# Name cluster score
#115 Bob a4 30
#147 Bob a5 -8
#19 Steve a1 30
#93 Steve a2 -38