Question

乔恩·斯普林（Jon Spring）发布的解决方案，在提供的答案的评论中）

#Applied to fruits example
df2 %>%
    select(id, name, score.x, year) %>%
    left_join(df1 %>% select(id, name, score.x, year),
    by = c("id", "name", "score.x", "year")) %>%
    mutate(match = score.x.x == score.x.y)

#Applied to df being worked with
Votesfull %>%
    select(rcid, session.x, country, unres, vote) %>%
    left_join(OTHER_DATA %>% select(rcid, session.x, country, unres, vote),
    by = c("rcid", "session.x","country", "unres")) %>%
    mutate(match = vote.x == vote.y)

我正在比较长度不同但结构相似的数据帧。有没有办法比较较长df和较短df的部分？

为清楚起见，我将较短的df称为df1，将较长的df2称为df2。 df1是较长的子集的子集，而df2是组成一个大df的相似表的集合。 df2的每个子部分长约6,000个观测值，而df1的长度也差不多。

我想寻求帮助，是否可以调用df2的这一小部分与df1进行比较，并不断对其进行迭代，直到到达df2的结尾为止。

我已经搜索并尝试了有关大小相等或相似的数据帧的解决方案，但是我找不到高度不同的数据帧的解决方案。在我正在使用的数据帧中，较大的一个大约比较短的一个大约长150倍，但总观测值略有不同，这意味着两个df的长度不是彼此的倍数。

数据结构本身可能有问题。如果是这样，我为我缺乏思想和技能深表歉意。

下面的示例df1和df2说明了这个难题：

df1 <- data.frame(
    "id" = 1:3,
    "name" = c('apple', 'apple', 'apple'),
    "score.x" = c(1, 3, 2),
    "year" = c(2000, 2001, 2002)
)

df2 <- data.frame(
    "id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
    "name" = c('orange', 'orange', 'orange', 'melon', 'melon', 'melon', 'grapes', 'grapes', 'grapes', 'lemon', 'lemon', 'lemon'),
    "score.x" = c(2, 3, 1, 1, 1, 2, 3, 3, 2, 1, 1, 1),
    "year" = c(2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002)
)

df1
    id  name    score.x  year
1   1   apple   1        2000
2   2   apple   3        2001
3   3   apple   2        2002

df2
    id  name    score.x  year
1   1   orange  2        2000
2   2   orange  3        2001
3   3   orange  1        2002
4   1   melon   1        2000
5   2   melon   1        2001
6   3   melon   2        2002
7   1   grapes  3        2000
8   2   grapes  3        2001
9   3   grapes  2        2002
10  1   lemon   1        2000
11  2   lemon   1        2001
12  3   lemon   1        2002

df2与df1相似，只是它具有更多的观测值。

有没有一种方法可以比较df2的一部分，也许是橙色（df2 [df2 $ name =='orange']）来比较df1，然后在瓜，葡萄和柠檬上进行迭代？

最后，我要感谢回答这个问题的任何人，并对认为问题提出得不好的任何人表示歉意。总的来说，我对R和Stack Overflow还是很陌生-我知道这并不能为我辩解。总而言之，我将尝试快速赶上并为社区创造更好的内容。

编辑：下面是我想应用的实际df的一部分：

Votesfull
        rcid  ccode  session.x  member  vote  Country  year  date        unres
1       3     2      1          1       1     USA      1946  1946-01-01  R/1/66
2       3     20     1          1       3     CAN      1946  1946-01-01  R/1/66
3       3     31     1          NA      NA     BHS      1946  1946-01-01  R/1/66
4       3     40     1          1       1     CUB      1946  1946-01-01  R/1/66
5       3     41     1          1       1     HTI      1946  1946-01-01  R/1/66
...
512792  2550  2      38         1       3     USA      1983  1983-12-07  R/38/183C
512793  2550  20     38         1       3     CAN      1983  1983-12-07  R/38/183C
512794  2550  31     38         1       2     BHS      1983  1983-12-07  R/38/183C
512795  2550  40     38         1       1     CUB      1983  1983-12-07  R/38/183C
512795  2550  41     38         1       2     HTI      1983  1983-12-07  R/38/183C
...
1041717 5338  2      69         1       3     USA      2014  2014-12-02  R/69/53
1041718 5338  20     69         1       2     CAN      2014  2014-12-02  R/69/53
1041719 5338  31     69         1       1     BHS      2014  2014-12-02  R/69/53
1041720 5338  40     69         1       1     CUB      2014  2014-12-02  R/69/53 
2014721 5338  41     69         1       1     HTI      2014  2014-12-02  R/69/53

对于上述示例df中名称似乎不重叠的困惑，我深表歉意。

此数据来自乔治敦大学。 Voeten博士的联合国大会投票数据，可通过Harvard Dataverse访问。 df具有重叠的rcid，session.x和unres（UN解析代码），可用于与其他国家/地区的另一行进行并行处理。

Edit2：所需结果的草图如下（注意匹配列）：

Votesfull
        rcid  ccode  session.x  member  vote  Country  year  date        unres     match
1       3     2      1          1       1     USA      1946  1946-01-01  R/1/66    TRUE
2       3     20     1          1       3     CAN      1946  1946-01-01  R/1/66    FALSE
3       3     31     1          NA      NA    BHS      1946  1946-01-01  R/1/66    NA
4       3     40     1          1       1     CUB      1946  1946-01-01  R/1/66    TRUE
5       3     41     1          1       1     HTI      1946  1946-01-01  R/1/66    TRUE
...
512792  2550  2      38         1       3     USA      1983  1983-12-07  R/38/183C TRUE
512793  2550  20     38         1       3     CAN      1983  1983-12-07  R/38/183C TRUE
512794  2550  31     38         1       2     BHS      1983  1983-12-07  R/38/183C FALSE
512795  2550  40     38         1       1     CUB      1983  1983-12-07  R/38/183C FALSE
512795  2550  41     38         1       2     HTI      1983  1983-12-07  R/38/183C FALSE
...
1041717 5338  2      69         1       3     USA      2014  2014-12-02  R/69/53   TRUE
1041718 5338  20     69         1       2     CAN      2014  2014-12-02  R/69/53   FALSE
1041719 5338  31     69         1       1     BHS      2014  2014-12-02  R/69/53   FALSE
1041720 5338  40     69         1       1     CUB      2014  2014-12-02  R/69/53   FALSE 
2014721 5338  41     69         1       1     HTI      2014  2014-12-02  R/69/53   FALSE

我应该澄清的是，数据帧的长度不完全相同。

因此，基本上，我正在努力根据匹配的vote数据来测试Votesfull中的每个rcid条目是否等于另一个df（结构相似）（每个rcid表示一个单独的投票会议，这意味着每个Country条目中的每个vote将有1个rcid条目。）

Edit3：使用原始水果示例绘制所需结果的草图：

df1
    id  name    score.x  year
1   1   apple   1        2000
2   2   apple   3        2001
3   3   apple   2        2002

#todo: compare apples to orange, melon, grapes, etc., for each id match
#e.g.) apple(id=1) vs orange(id=1), apple(id=2) vs orange(id=2), so on..

df2
    id  name    score.x  year  match_apple
1   1   orange  2        2000  FALSE       #for id=1, score 2 != 1
2   2   orange  3        2001  TRUE        #for id=2, score 3 == 3
3   3   orange  1        2002  FALSE       #for id=3, score 1 != 2
4   1   melon   1        2000  TRUE
5   2   melon   1        2001  FALSE
6   3   melon   2        2002  TRUE
7   1   grapes  3        2000  FALSE
8   2   grapes  3        2001  TRUE
9   3   grapes  2        2002  TRUE
10  1   lemon   1        2000  FALSE
11  2   lemon   1        2001  FALSE
12  3   lemon   1        2002  FALSE
13  1   berry   1        2000  TRUE        #added new fruit to demo NA
14  2   berry   2        2001  FALSE
15  3   berry   NA       2002  NA          #some values of df are NA

Answer 1

根据您的评论，您想要测试score.x中每个水果的每个df2条目是否等于df1。这是使用dplyr和group_by来实现的一种方法。

我同时进行了逐项测试和平均成绩的比较。

平均得分比较：

library(dplyr)

df2 %>%
  group_by(name) %>%
  summarise(avg = mean(score.x)) %>%
  mutate(match_df1 = avg == mean(df1$score.x))

# A tibble: 4 x 3
  name     avg match_df1
  <fct>  <dbl> <lgl>    
1 grapes  2.67 FALSE    
2 lemon   1    FALSE    
3 melon   1.33 FALSE    
4 orange  2    TRUE

每个水果的每个项目与df1中每个苹果条目的比较：

df2 %>%
  group_by(name) %>%
  mutate(match_df1 = score.x == df1$score.x) 

  # A tibble: 12 x 5
# Groups:   name [4]
      id name   score.x  year match_df1
   <dbl> <fct>    <dbl> <dbl> <lgl>    
 1     1 orange       2  2000 FALSE    
 2     2 orange       3  2001 TRUE     
 3     3 orange       1  2002 FALSE    
 4     1 melon        1  2000 TRUE     
 5     2 melon        1  2001 FALSE    
 6     3 melon        2  2002 TRUE     
 7     1 grapes       3  2000 FALSE    
 8     2 grapes       3  2001 TRUE     
 9     3 grapes       2  2002 TRUE     
10     1 lemon        1  2000 TRUE     
11     2 lemon        1  2001 FALSE    
12     3 lemon        1  2002 FALSE

Answer 2

下面是在两个表之间进行联接以查看它们是否在其他列匹配的score.x中匹配的示例。

“ left_join”从第一张表中取出每一行，并为第二张表中的每个匹配项输出一行，该行由by = c("id", "name", "year")部分中命名的列定义。由于您现在拥有两个score.x版本，因此df1的原始版本重命名为score.x.x，而df2的原始版本重命名为score.x.y。

library(dplyr)
df1 %>%          # (Note, I've modified df1 to be "oranges" so we'll have matches)
  left_join(     # Keep everything in df1 and connect to each matching row in...
    df2,                          # df2, defined by matching...
    by = c("id", "name", "year")  # id, name, and year
  ) %>%
  mutate(match = score.x.x == score.x.y)  # ...and say whether they match

# Here's the output
  id   name score.x.x year score.x.y match
1  1 orange         1 2000         2 FALSE
2  2 orange         3 2001         3  TRUE
3  3 orange         2 2002         1 FALSE

样本数据，在OP中稍作修改

df1 <- data.frame(
  "id" = 1:3,
  "name" = c('orange', 'orange', 'orange'),  # Changed to make matches
  "score.x" = c(1, 3, 2),
  "year" = c(2000, 2001, 2002)
)

df2 <- data.frame(
  "id" = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3),
  "name" = c('orange', 'orange', 'orange', 'melon', 'melon', 'melon', 'grapes', 'grapes', 'grapes', 'lemon', 'lemon', 'lemon'),
  "score.x" = c(2, 3, 1, 1, 1, 2, 3, 3, 2, 1, 1, 1),
  "year" = c(2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002, 2000, 2001, 2002)
)

有没有一种方法可以根据匹配的列值来比较两个不同长度的数据帧？

2 个答案: