Question

我有数据框：

df=data.frame(doc_id=c(1, 1, 2, 2), terms=c("virginia","bye","energy","energy"), freq=c(1,1,2,1))

即

> df
  doc_id    terms freq
1      1 virginia    1
2      1      bye    1
3      2   energy    2
4      2   energy    1

我想删除列doc_id和terms中的重复项；例如，第3行和第4行具有相同的doc_id和terms字段。但是我要保留的重复项应该是在freq字段中具有最大值的重复项。

Answer 1

这里是slice的一个选项。按'doc_id'，'terms'分组后，slice的{{1}}值为'freq'的行

max

或者如果只有三列，则只是library(dplyr) df %>% group_by(doc_id, terms) %>% slice(which.max(freq)) # A tibble: 3 x 3 # Groups: doc_id, terms [3] # doc_id terms freq # <dbl> <fct> <dbl> #1 1 bye 1 #2 1 virginia 1 #3 2 energy 2

summarise

或者使用df %>% group_by(doc_id, terms) %>% summarise(freq = max(freq))和arrange

distinct

或者在df %>% arrange(doc_id, terms, desc(freq)) %>% distinct(doc_id, terms, .keep_all = TRUE)中，首先base R数据集，以便'freq'的order值将成为每个组的第一行，然后使用max删除重复的行

duplicated

Answer 2

另一个基本R选项：使用subset + ave

dfout <- subset(df,
                !!ave(freq,
                      doc_id,
                      terms,
                      FUN = function(x) seq_along(x)==which.max(x)))

给出

> dfout
  doc_id    terms freq
1      1 virginia    1
2      1      bye    1
3      2   energy    2

或者使用aggregate的更紧凑版本（感谢@akrun）

dfout <- aggregate(freq ~ ., df, FUN = max)

给出

> dfout
  doc_id    terms freq
1      1      bye    1
2      2   energy    2
3      1 virginia    1

根据列删除数据框中的行重复项

2 个答案: