根据2列中的值+模糊匹配对表进行重复数据删除

时间:2019-02-26 23:06:35

标签: r duplicates record-linkage

我有一个从Zotero导出的CSV文件,其中包含我的库条目的元数据。我知道它包含很多重复项,但是要删除它们并不容易:

  • 并非所有具有相似标题的商品实际上都是重复的,例如

    | Year |            Author             |    Title     |
    +------+-------------------------------+--------------+
    | 2016 | Jones, Erik                   | Book Reviews |
    | 2016 | Hassner, Pierre; Jones, Erik  | Book Reviews |
    | 2010 | Adams, Laura L.; Gagnon, Chip | Book Reviews |
    
  • 并非所有实际上相似的项目都具有100%相同的元数据字符串,例如

    |    Author     |                     Title                     |
    +---------------+-----------------------------------------------+
    | Tichý, Lukáš; | Can Iran Reduce EU Dependence on Russian Gas? |
    | Tichy, L.;    | "can iran reduce eu dependence onrussian gas" |
    

这是一个极端的例子(通常差异并不大),但是正如您所看到的,预清洗不能完全解决这个问题。因此,我们的想法是消除在两列或更多列中包含相似值的行-例如“作者”和“标题”。

到目前为止,我已经尝试过/浏览过的内容:

  • OpenRefine-几乎不熟悉它,因此无法提出或找到任何可行的方法。
  • Excel fuzzy lookup extension-不能真正按照我的方式工作。
  • Python-再次,我对语言不满意;我找不到任何相关的解决方案/指南。
  • R:试用了一些想法:

首先在“作者”列上的for循环中使用agrep获取具有重复项的行的索引;然后对“标题”列执行相同操作;然后比较向量,并对值重合的行进行重复数据删除。不用说,我无法超越第一步:

titles <- unlist(corpus$"Title")
for (i in 1:length(titles)){
  Title_dupe_temp <- agrep(titles[i], titles[i+1:length(titles)], 
                           max.distance = 1, ignore.case = TRUE, fixed = FALSE)
  Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ")
}

结果(几乎)完全是乱码;另外我收到警告消息:

In Title_dupes[i] <- paste(i, Title_dupe_temp, sep = " ") :
  number of items to replace is not a multiple of replacement length

我也阅读了fuzzywuzzyR文档,但是没有找到任何有帮助的功能。

最后,我尝试了RecordLinkage软件包。不过,我无法超越基础知识。 The documentation相当繁重,在所有事物上都不明确;指南很稀少,而我发现的指南(例如this)使用的示例数据集已经准备好了身份矢量-因此我无法弄清楚如何将其复制到我的数据中。

因此,在这一点上,我不管是否要在OpenRefine / R / Py / SQL /中进行任何操作,都可以以任何方式进行操作。

2 个答案:

答案 0 :(得分:1)

解决方案一: 使用循环和库stringdist

library(stringdist)
    zotero<-data.frame(
      Year=c(2016,2016,2010,2010,2010,2010),
      Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
      Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
    )

    zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)
    zotero<-zotero[order(zotero[,1],zotero[,2]),]

    atot<-NULL
    for (i in 2:dim(zotero)[1]){
      a<-stringdist(zotero$onestring[i-1],zotero$onestring[i])/(nchar(zotero$onestring[i-1])+nchar(zotero$onestring[i]))
      atot<-rbind(atot,a)
    }

    zotero<-cbind(zotero,threshold=c(1,atot))
    zotero[zotero$threshold>0.15,]

解决方案II:使用矩阵计算要比使用循环计算更快:首先我根据您的数据样本创建一个数据框,其次我删除了非UTF字符,其次我使用了库stringdist计算距离矩阵您可以轻松地将它们转换为相似百分比。

zotero<-data.frame(
  Year=c(2016,2016,2010,2010,2010,2010),
  Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
  Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
)

zotero$onestring<-paste0(zotero$Year,zotero$Author,zotero$Title)

Encoding(zotero$onestring) <- "UTF-8"
zotero$onestring<-iconv(zotero$onestring, "UTF-8", "UTF-8",sub='')

library(stringdist)
stringdistmatrix(zotero$onestring)

结果:

> stringdistmatrix(zotero$onestring)
   1  2  3  4  5
2 11            
3 13 14         
4 47 45 44      
5 47 45 44  0   
6 47 45 42 13 13

答案 1 :(得分:1)

我使用了与@Nakx类似的方法,并且我喜欢矩阵解决方案。但是,您也可以尝试使用gsubiconv清理更多内容,并使用sapply进行匹配(索引不是自己的最佳匹配值..0)。像这样:

    > library(RecordLinkage)
> 
> zotero<-data.frame(
+   Year=c(2016,2016,2010,2010,2010,2010),
+   Author=c("Jones, Erik","Hassner, Pierre;","Adams, Laura L.;","Tichý, Lukáš;","Tichý, Lukáš;","Tichy, L.;"),
+   Title=c("Book Reviews","Book Reviews","Book Reviews","Can Iran Reduce EU Dependence on Russian Gas?","Can Iran Reduce EU Dependence on Russian Gas?","can iran reduce eu dependence onrussian gas")
+ )
> 
> # Converting the special characters
> zotero$Author_new <- iconv(zotero$Author, from = '', to = "ASCII//TRANSLIT")
> zotero$Author_new <- tolower(zotero$Author_new)
> zotero$Author_new <- gsub("[[:punct:]]", "", zotero$Author_new)
> 
> # Removing punctuation making it lowercase
> zotero$Title_new <- gsub("[[:punct:]]", "", zotero$Title)
> zotero$Title_new <- tolower(zotero$Title_new)
> 
> # Removing exact duplicates
> dups <- duplicated(zotero[,c("Title_new", "Author_new", "Year")])
> zotero <- zotero[!dups,]
> zotero
  Year           Author                                         Title     Author_new
1 2016      Jones, Erik                                  Book Reviews     jones erik
2 2016 Hassner, Pierre;                                  Book Reviews hassner pierre
3 2010 Adams, Laura L.;                                  Book Reviews  adams laura l
4 2010    Tichý, Lukáš; Can Iran Reduce EU Dependence on Russian Gas?    tichy lukas
6 2010       Tichy, L.;   can iran reduce eu dependence onrussian gas        tichy l
                                     Title_new Title_dist Author_dist
1                                 book reviews          0           9
2                                 book reviews          0           9
3                                 book reviews          0           9
4 can iran reduce eu dependence on russian gas          0           0
6  can iran reduce eu dependence onrussian gas          1           4
>
> # Creating a distance measure for your title, author, and year
> zotero$Title_dist <- sapply(zotero$Title_new, function(x) sort(levenshteinDist(x, zotero$Title_new))[2])
> zotero$Author_dist <- sapply(zotero$Author_new, function(x) sort(levenshteinDist(x, zotero$Author_new))[2])
>
> # Filter here

从那里,您可以使用距离变量创建条件和过滤。例如,如果文章的作者距离为2而标题距离为5,则您可能会感到很舒服。

编辑以阐明过滤示例。查看数据后,您需要进行调整。总是很保守的开始

> library(dplyr)
> zotero <- zotero %>%
+   group_by(Year) %>%
+   filter(!between(Title_dist, 1, 5) | 
+          !between(Author_dist, 1, 5))
> zotero
# A tibble: 4 x 7
# Groups:   Year [2]
   Year Author       Title                     Author_new   Title_new                   Title_dist Author_dist
  <dbl> <fct>        <fct>                     <chr>        <chr>                            <int>       <int>
1  2016 Jones, Erik  Book Reviews              jones erik   book reviews                         0           9
2  2016 Hassner, Pi~ Book Reviews              hassner pie~ book reviews                         0           9
3  2010 Adams, Laur~ Book Reviews              adams laura~ book reviews                         0           9
4  2010 Tichý, Luká~ Can Iran Reduce EU Depen~ tichy lukas  can iran reduce eu depende~          0           0