groupby并删除r数据帧中的最低值

时间:2014-08-31 03:01:45

标签: r group-by duplicates

我基本上希望从数据框中删除重复项,并在列中保留最低值,按两列(名称和群集)分组。例如,这里是我的数据帧:

       Name   cluster   score
19     Steve   a1       30
51     Steve   a2       30
83     Steve   a2      -28
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8
179    Bob     a5       30
在pandas和sql中,这将由一个groupby完成,但我很难在R中弄明白,甚至真的开始了。我试过做一个双重名称和集群。第一个groupby是Name,然后是cluster。所以既然有三个史蒂夫,a2'我只想保持得分最低的那个。

我想要的输出如下:

       Name   cluster   score
19     Steve   a1       30
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8

任何帮助将不胜感激

4 个答案:

答案 0 :(得分:2)

这是有效的

library(dplyr)


Name=c("Steve", "Steve", "Steve", "Steve", "Bob", "Bob", "Bob")
cluster=c("a1", "a2", "a2", "a2", "a4", "a5", "a5")
score=c(30,30,-28,-38,30,-8,30)
yourdf<-data.frame(Name,cluster,score)

yourdf %>%
  group_by(Name,cluster) %>%
  filter(score == min(score))

   Name cluster score
1 Steve      a1    30
2 Steve      a2   -38
3   Bob      a4    30
4   Bob      a5    -8

答案 1 :(得分:2)

一个简单的data.table解决方案

library(data.table)
setDT(df)[, list(score = score[which.min(score)]), by = list(Name, cluster)]
#     Name cluster score
# 1: Steve      a1    30
# 2: Steve      a2   -38
# 3:   Bob      a4    30
# 4:   Bob      a5    -8

答案 2 :(得分:2)

这适用于aggregate

> aggregate(score ~ Name + cluster, mydf, min)
#    Name cluster score
# 1 Steve      a1    30
# 2 Steve      a2   -38
# 3   Bob      a4    30
# 4   Bob      a5    -8

其中mydf是您的原始数据。

答案 3 :(得分:1)

这是一个基础R方法:

# Read in sample data
df<-read.table(text="
       Name   cluster   score
19     Steve   a1       30
51     Steve   a2       30
83     Steve   a2      -28
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8
179    Bob     a5       30", header=TRUE)

# order it
df_sorted <- df[with(df, order(Name, cluster, score)),]

# get rid of duplicated names and clusters, keeping the first,
# which will be the minimum score due to the sorting.

df_sorted[!duplicated(df_sorted[,c('Name','cluster')]), ]
#     Name cluster score
#115   Bob      a4    30
#147   Bob      a5    -8
#19  Steve      a1    30
#93  Steve      a2   -38