R中数据帧的交叉比较

时间:2018-12-03 18:39:13

标签: r

我必须相互比较两个人的数据集。

比方说,我有一个数据列,其中的列addToBackStack()很少。

a =

还有 ID | Name | Gender | Country —————————————————————————————————————————————————————————— 1 | Mattias Adams | M | UK 2 | James Alan | M | Canada 3 | Dana Benton | F | USA 4 | Ella Collins | F | USA

b =

数据帧 ID | First_Name | Last_name | Third_name | Whole_name | Gender ———————————————————————————————————————————————————————————————————————————— 1 | Gary | Cole | Allan | Gary Allan Cole | M 2 | Dana | Benton | NA | Dana Benton | F 3 | Lena | Jamison | Anne | Lena Anne Jamison | F 4 | Matt | King | NA | Matt King | M 较大,包含约100,000行,而a包含少于1,000行。

目标是使用b中的数据在b中查找匹配的记录。因此,如果存在匹配项,则返回a中的整个行。

我想尝试两种方法。首先从a中的b$"Whole_name"中查找完全匹配。

完全匹配:

a$"Name"

在这种情况下, eue_wn <- as.character(b$"Whole_name") eue_wn_match <- a[which(as.character(a$"Name") %in% eue_wn),] if (nrow(eue_wn_match) == 0) { eue_wn_match <- "No matches" } 的输出为:

eue_wn_matc

模式匹配:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

因此,在此过程中,匹配过程分为3个阶段。 名字 eup_ln <- paste(as.character(b$"Last_name"), collapse = "|") eup_fn <- paste(as.character(b$"First_Name"), collapse = "|") eup_tn <- paste(as.character(b$"Third_name"), collapse = "|") eup_match <- a[which(grepl(eup_ln, as.character(a$"Name"), ignore.case = TRUE)),] #First filter (last name) if (nrow(eup_match) == 0) { eup_match <- "No matches" } if (nrow(eup_match) > 0) { eup_match2 <- eup_match[which(grepl(eup_fn, as.character(eup_match$"Name"), ignore.case = TRUE)),] #Second filter (first name) if (nrow(eup_match2) == 0 ) { eup_match2 <- "No matches" } } if (nrow(eup_match2) > 0) { eup_match3 <- eup_match2[which(grepl(eup_tn, as.character(eup_match2$"Name"), ignore.case = TRUE)),] #Third filter (third_name) if (nrow(eup_match3) == 0 ) { eup_match3 <- "No matches" } } 是找到姓氏的结果。然后,结果取eup_match的第二个匹配项作为名字,结果eup_match2显示符合两个条件的记录。最后,获取最后一个结果,并将其与第三个名称eup_match3

进行匹配

在这种情况下,它们三个的结果都相同:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

那是不正确的。只有eup_matcheup_match2应该具有该输出。从第一阶段开始,我们就匹配了Dana Benton(a)Dana(b) 在下一阶段,比赛为Dana Benton(a)Benton (b)。而且由于她没有姓氏,因此无法将她与姓氏匹配。 问题出在:

eup_tn <- paste(as.character(b$"Third_name"), collapse = "|")

此输出为:

"Allan|NA|Anne|NA"

由于NA已转换为字符,因此该函数能够在a和b中找到模式。在这种特殊情况下,Dana Benson (a)和NA (b)

关于如何纠正该问题的任何想法?

另一个问题与输出有关。有什么办法可以同时输出ab的结果

示例:如果我们仅通过模式将a$Nameb$First_Name匹配,结果将是

ID  |     Name        |      Gender      |  Country   | Match | Match ID
———————————————————————————————————————————————————————————————————————————
1   | Mattias Adams   |        M         |    UK      | Matt  |    4 
3   | Dana Benton     |        F         |    USA     | Dana  |    2

因此前4列来自数据集a,后两列来自b   将根据匹配的Match | Match ID中的记录显示b列。

给出的测试示例的期望输出为:

    ID  |     Name        |      Gender      |  Country   
   —————————————————————————————————————————————————————————— 
    3   | Dana Benton     |        F         |    USA

很抱歉,很长的帖子。我试图使它尽可能清晰。如果有人想重新创建它,可以在这里找到xlsx文件ab以及r代码:MyDropbox

如果有人对如何处理此主题有其他建议,欢迎提出。感谢您的帮助。

2 个答案:

答案 0 :(得分:1)

方法1:完全匹配

为什么没有类似的东西

library(stringr)
library(dplyr)
a <- a %>%
    # Extract first and last names into new variables
    mutate(First_Name = str_extract(Name, "^[A-z]+"),
           Last_Name = str_extract(Name, "[A-z]+$"),)

# Inner Join by first and last name.
# Add a suffix to be able to distinguish the origin of columns.
b %>% inner_join(a, by = c("First_Name", "Last_Name"), suffix = c(".b", ".a")) %>%
    # Select the columns you want to see.
    # Note that only the colums that have an ambiguous name have a suffix.
    select(ID.a, Name, Gender.a, Country, First_Name, Last_Name, ID.b)

如果只寻找完全匹配的项目,效果很好。如果愿意,还可以通过str_extract(string, "[^A-z]+[A-z]+[^A-z$]")从字符串中提取中间名。

结果:
  ID.a        Name Gender.a Country First_Name Last_Name ID.b
1    3 Dana Benton        F     USA       Dana    Benton    2

方法2:字距(Jaro-Winkler)

this great post展开:

library(RecordLinkage)
library(dplyr)

lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = jarowinkler(target, source)) %>%
    summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

.8或.9以上的值都应该是不错的选择。仍然不完美。如果您的数据干净,则可以尝试分别匹配名字和姓氏。

结果:
# A tibble: 4 x 8
  target        match matched_to         ID First_Name Last_Name Third_Name Gender
  <chr>         <dbl> <chr>           <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton   1     Dana Benton         2 Dana       Benton    NA         F     
2 Ella Collins  0.593 Matt King           4 Matt       King      NA         M     
3 James Alan    0.667 Gary Allan Cole     1 Gary       Cole      Allan      M     
4 Mattias Adams 0.792 Matt King           4 Matt       King      NA         M     


方法3:字距(Levenshtein)

与上述相同,仅使用Levenshtein距离和which.min()

library(RecordLinkage)
library(dplyr)


lookup <- expand.grid(target = a$Name, source = b$Whole_Name, stringsAsFactors = FALSE)

lookup %>% group_by(target) %>%
    mutate(match_score = levenshteinDist(target, source)) %>%
    summarise(match = match_score[which.min(match_score)], matched_to = ref[which.min(match_score)]) %>%
    inner_join(b, c("matched_to" = "Whole_Name"))

如预期的那样,这会导致性能比JW差。

结果:
# A tibble: 4 x 8
  target        match matched_to     ID First_Name Last_Name Third_Name Gender
  <chr>         <int> <chr>       <dbl> <chr>      <chr>     <chr>      <chr> 
1 Dana Benton       0 Dana Benton     2 Dana       Benton    NA         F     
2 Ella Collins      9 Dana Benton     2 Dana       Benton    NA         F     
3 James Alan        8 Matt King       4 Matt       King      NA         M     
4 Mattias Adams     8 Matt King       4 Matt       King      NA         M     


数据

a <- structure(list(ID = c(1, 2, 3, 4), Name = c("Mattias Adams", "James Alan", "Dana Benton", "Ella Collins"), Gender = c("M", "M", "F", "F"), Country = c("UK", "Canada", "USA", "USA")), .Names = c("ID", "Name", "Gender", "Country"), row.names = c(NA, -4L), class = "data.frame")
b <- structure(list(ID = c(1, 2, 3, 4), First_Name = c("Gary", "Dana", "Lena", "Matt"), Last_name = c("Cole", "Benton", "Jamison", "King"), Third_Name = c("Allan", "NA", "Anne", "NA"), Whole_name = c("Gary Allan Cole", "Dana Benton", "Lena Anne Jamison", "Matt King"), Gender = c("M", "F", "F", "M")), .Names = c("ID", "First_Name", "Last_Name", "Third_Name", "Whole_Name", "Gender"), row.names = c(NA, -4L), class = "data.frame")

答案 1 :(得分:0)

如果要避免与NA的错误匹配,请不要在模式中包括它。改用它:

eup_tn <- paste(na.omit(as.character(b$"Third_name")), collapse = "|")

关于您的第二个问题:是通过使用基数R中的merge()函数来完成的,或者是在?dplyr::join中,可能是inner_join()中对其进行的替换之一。