Question

我有一个数据框，其中包含几乎几乎相同的行。我需要编写代码，以便选择一对（或3或4组）行中的行中的一个，这些行的名称，姓氏，V1和P1相似，但V2和P2不同，然后删除不需要的行。需要保持哪一行将通过以下条件指定：如果P1 = P2，则必须保留该行；如果P1 P2，则应保留最大P2的行

id   Name  Surname     V1         P1       V2        P2
15  John    Smith     0.80        4       0.75        2    
16  John    Smith     0.80        4       1.00        3    
17  John    Smith     0.80        5       0.75        2    
18  John    Smith     0.80        5       1.00        3    
19  John    Smith     0.75        2       0.75        2    
20  John    Smith     0.75        2       1.00        3

我期望的输出如下所示：

id  Name  Surname     V1          P1       V2        P2
16  John    Smith     0.80        4       1.00        3    
18  John    Smith     0.80        5       1.00        3    
19  John    Smith     0.75        2       0.75        2

有没有简单的方法可以做到这一点？

扩展数据集

id    Name Surname V1     P1    V2      P2
194   Lisa  Paul   0,1    1     0,2      1
195   Lisa  Paul   0,1    1     0,4      5
196   Lisa  Paul   0,1    4     0,5      1
197   Lisa  Paul   0,1    4     0,1      5
198   Lisa  Paul   0,1    2     0,1      1
199   Lisa  Paul   0,1    2     0,4      5
201   Lisa  Paul   0,1    3     0,2      1
202   Lisa  Paul   0,1    3     0,1      5
203   Lisa  Paul   0,1    5     0,3      1
204   Lisa  Paul   0,1    5     0,2      5
205   Lisa  Paul   0,1    6     0,2      1
206   Lisa  Paul   0,1    6     0,1      5

Answer 1

至少在您的示例数据中，以下内容似乎可行：

library(tidyverse)
data %>% group_by(Name, Surname, P1) %>%
  filter(P2 == max(P2[P1 >= P2]))
# A tibble: 6 x 7
# Groups:   Name, Surname, P1 [6]
#      id Name  Surname    V1    P1    V2    P2
#   <int> <fct> <fct>   <dbl> <int> <dbl> <int>
# 1   194 Lisa  Paul      0.1     1   0.2     1
# 2   196 Lisa  Paul      0.1     4   0.5     1
# 3   198 Lisa  Paul      0.1     2   0.1     1
# 4   201 Lisa  Paul      0.1     3   0.2     1
# 5   204 Lisa  Paul      0.1     5   0.2     5
# 6   206 Lisa  Paul      0.1     6   0.1     5

对于名称，姓氏和P1的每个唯一组，我保留所有P2中最大的P2行，这些行不大于相应的P1。

通过R

1 个答案: