有什么办法检查两个数据帧中的某些重复行是否相同?

时间:2019-09-25 02:05:50

标签: r dataframe

df1是df2的子集,我想检查df1中重复行的id数目是否与df2中相同?因此我想从更大的数据帧df1中创建2个新数据帧,并在其中一个中保留重复行数相同的行,否则将其保留在另一数据集中。

示例:

              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                1        3          2
                2        1          1
                2        1          1
                2        2          2
                2        3          4


              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                1        3          2
                1        3          2
                2        1          1
                2        1          1
                2        2          2
                2        2          2
                2        3          4
                2        3          4
                2        4          1

放出

来自df2的数据在2个数据集中具有相同的重复行数:

              SAMPN    PERNO       loop
                1        1          1
                1        1          1
                1        1          2
                1        2          2
                2        1          1
                2        1          1

来自df2的数据在2个数据集中的重复行数不同:

              SAMPN    PERNO       loop

                1        3          2
                1        3          2
                2        2          2
                2        2          2
                2        3          4
                2        3          4
                2        4          1

要检查的数据

structure(list(SAMPN = c(50, 50, 50, 50, 50, 50, 51, 53, 53, 
53, 53, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54, 54), PERNO = c(4, 
4, 5, 5, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 4, 5, 5, 5
), PLANO = c(4, 5, 2, 3, 2, 3, 3, 2, 3, 4, 5, 2, 3, 4, 5, 6, 
7, 2, 3, 2, 3, 4), loop = c(3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 3, 3, 3, 2, 2, 2, 2, 3), TPURP = structure(c(16L, 2L, 
5L, 2L, 5L, 2L, 2L, 18L, 18L, 13L, 2L, 8L, 3L, 2L, 20L, 13L, 
2L, 5L, 2L, 5L, 2L, 3L), .Label = c("(1) Working at home (for pay)", 
"(2) All other home activities", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger", "(10) Other, specify - transportation", 
"(11) Work/Business related", "(12) Service Private Vehicle", 
"(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15) Household errands", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(20) Recreation/Entertainment", 
"(21) Visit friends/relative", "(24) Loop trip", "(97) Other, specify"
), class = "factor")), row.names = 431:452, class = "data.frame")


structure(list(SAMPN = c(48, 50, 50, 50, 50, 50, 56, 56, 58, 
58, 58, 58, 58, 58, 58, 58), PERNO = c(7, 1, 1, 2, 3, 6, 1, 3, 
1, 1, 1, 1, 2, 2, 2, 2), PLANO = c(3, 2, 4, 2, 4, 2, 6, 3, 2, 
3, 4, 5, 2, 3, 4, 5), loop = c(2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 
2, 2, 2, 2, 2, 2), TPURP = structure(c(2L, 8L, 22L, 8L, 22L, 
5L, 2L, 2L, 18L, 17L, 13L, 2L, 16L, 17L, 13L, 2L), .Label = c("(1) Working at home (for pay)", 
"(2) All other home activities", "(3) Work/Job", "(4) All other activities at work", 
"(5) Attending class", "(6) All other activities at school", 
"(7) Change type of transportation/transfer", "(8) Dropped off passenger", 
"(9) Picked up passenger", "(10) Other, specify - transportation", 
"(11) Work/Business related", "(12) Service Private Vehicle", 
"(13) Routine Shopping", "(14) Shopping for major purchases", 
"(15) Household errands", "(16) Personal Business", "(17) Eat meal outside of home", 
"(18) Health care", "(19) Civic/Religious activities", "(20) Recreation/Entertainment", 
"(21) Visit friends/relative", "(24) Loop trip", "(97) Other, specify"
), class = "factor")), row.names = c(412L, 420L, 422L, 423L, 
428L, 435L, 467L, 474L, 480L, 481L, 482L, 483L, 484L, 485L, 486L, 
487L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

也许有一种更简单的方法,但是这里有一种使用dplyr的方法。我们首先在两个数据帧中的每个组中count个行数,然后执行left_join

library(dplyr)

df3 <- left_join(df2 %>% count(SAMPN, PERNO, loop), 
                 df1 %>% count(SAMPN, PERNO, loop), by = c("SAMPN", "PERNO","loop"))

我们从df2中选择计数与df3匹配的行

df3 %>%
  filter(n.x == n.y) %>%
  select(names(df2)) %>%
  inner_join(df2)

 #  SAMPN PERNO  loop
 #  <int> <int> <int>
#1     1     1     1
#2     1     1     1
#3     1     1     2
#4     1     2     2
#5     2     1     1
#6     2     1     1

和另一个计数不匹配的

df3 %>%
  filter(n.x != n.y | is.na(n.y)) %>%
  select(names(df2)) %>%
  inner_join(df2)

#  SAMPN PERNO  loop
#  <int> <int> <int>
#1     1     3     2
#2     1     3     2
#3     2     2     2
#4     2     2     2
#5     2     3     4
#6     2     3     4
#7     2     4     1