Question

我基本上希望从数据框中删除重复项，并在列中保留最低值，按两列（名称和群集）分组。例如，这里是我的数据帧：

       Name   cluster   score
19     Steve   a1       30
51     Steve   a2       30
83     Steve   a2      -28
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8
179    Bob     a5       30

在pandas和sql中，这将由一个groupby完成，但我很难在R中弄明白，甚至真的开始了。我试过做一个双重名称和集群。第一个groupby是Name，然后是cluster。所以既然有三个史蒂夫，a2＆＃39;我只想保持得分最低的那个。

我想要的输出如下：

       Name   cluster   score
19     Steve   a1       30
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8

任何帮助将不胜感激

Answer 1

这是有效的

library(dplyr)


Name=c("Steve", "Steve", "Steve", "Steve", "Bob", "Bob", "Bob")
cluster=c("a1", "a2", "a2", "a2", "a4", "a5", "a5")
score=c(30,30,-28,-38,30,-8,30)
yourdf<-data.frame(Name,cluster,score)

yourdf %>%
  group_by(Name,cluster) %>%
  filter(score == min(score))

   Name cluster score
1 Steve      a1    30
2 Steve      a2   -38
3   Bob      a4    30
4   Bob      a5    -8

Answer 2

一个简单的data.table解决方案

library(data.table)
setDT(df)[, list(score = score[which.min(score)]), by = list(Name, cluster)]
#     Name cluster score
# 1: Steve      a1    30
# 2: Steve      a2   -38
# 3:   Bob      a4    30
# 4:   Bob      a5    -8

Answer 3

这适用于aggregate。

> aggregate(score ~ Name + cluster, mydf, min)
#    Name cluster score
# 1 Steve      a1    30
# 2 Steve      a2   -38
# 3   Bob      a4    30
# 4   Bob      a5    -8

其中mydf是您的原始数据。

Answer 4

这是一个基础R方法：

# Read in sample data
df<-read.table(text="
       Name   cluster   score
19     Steve   a1       30
51     Steve   a2       30
83     Steve   a2      -28
93     Steve   a2      -38
115    Bob     a4       30
147    Bob     a5       -8
179    Bob     a5       30", header=TRUE)

# order it
df_sorted <- df[with(df, order(Name, cluster, score)),]

# get rid of duplicated names and clusters, keeping the first,
# which will be the minimum score due to the sorting.

df_sorted[!duplicated(df_sorted[,c('Name','cluster')]), ]
#     Name cluster score
#115   Bob      a4    30
#147   Bob      a5    -8
#19  Steve      a1    30
#93  Steve      a2   -38

groupby并删除r数据帧中的最低值

4 个答案: