我有一个数据框,例如:
Groups Name names2 Category value
G1 A habit1 cat1 20
G1 A habit2 NA 1
G1 B habit3 NA 100
G1 B habit4 cat3 23
G2 A habit5 cat4 32
G2 C habit6 NA 100
G2 C habit7 cat2 21
G2 D habit8 cat3 34
G2 D habit9 cat5 43
并且我只希望每个Groups
和每个Name
保留一行
并获得:
Groups Name names2 Category value
G1 A habit1 cat1 20
G1 B habit4 cat3 23
G2 A habit5 cat4 32
G2 C habit7 cat2 21
G2 D habit9 cat5 43
其中Group
和Name
中获胜的行是names2
中有信息(而不是NA
)的行,如果全部都有信息,拥有最高价值的客户赢得了(as G2-D vs G2-D)
的42场胜利,因为42 > 34
如果只有NA
,则无论如何都要保留具有最佳Value的行。
谢谢您的帮助
答案 0 :(得分:3)
您需要的是group_by
和filter
,然后是top_n
:
library(dplyr)
my.df %>%
group_by(Groups, Name) %>%
filter(!is.na(Category)) %>%
top_n(1, value)
# A tibble: 5 x 5
# Groups: Groups, Name [5]
# Groups Name names2 Category value
# <chr> <chr> <chr> <chr> <int>
# 1 G1 A habit1 cat1 20
# 2 G1 B habit4 cat3 23
# 3 G2 A habit5 cat4 32
# 4 G2 C habit7 cat2 21
# 5 G2 D habit9 cat5 43
但是,这将排除该名称,组组合的所有条目中缺少“类别”的组,并且如果存在多个最大值,则保留所有这些值。
数据
my.df <- structure(list(Groups = c("G1", "G1", "G1", "G1", "G2", "G2", "G2", "G2", "G2"),
Name = c("A", "A", "B", "B", "A", "C", "C", "D", "D"),
names2 = c("habit1", "habit2", "habit3", "habit4", "habit5", "habit6", "habit7", "habit8", "habit9"),
Category = c("cat1", NA, NA, "cat3", "cat4", NA, "cat2", "cat3", "cat5"),
value = c(20L, 1L, 100L, 23L, 32L, 100L, 21L, 34L, 43L)),
class = "data.frame", row.names = c(NA, -9L))