我已经应用了tm
包中的各种清洁功能,例如删除标点符号,数字,特殊字符,常用英语单词等,并获得了如下所示的数据框。请记住,我没有主要密钥,例如cust_id或account_number可以依赖
sno names
001 SIRIS BLACK
002 JOHN DOE
003 STEPHEN HRYY
004 SIRIUS BLACK
005 SIRUS BLACK
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 JHN DOE
看看上面的数据,我真的觉得模式有相似之处,而且这些名字彼此接近。如何使用R的可用文本挖掘函数计算模式相等的百分比,以便最终获得具有所有唯一名称的数据框?
假设和缺点:
直截了当地假设唯一名称可能是具有最大字符的名称,因为我所拥有的原始数据在名称上有大量错别字。 (逻辑假设,也许会减少拼写错误的数量)
agrep()
函数在大字符串中搜索与模式的近似匹配,这里的问题是我实际上不知道模式是什么。
将类似的字符串分组如下:
sno names
001 SIRIS BLACK
002 SIRIUS BLACK
003 SIRUS BLACK
004 JHN DOE
005 JOHN DOE
006 JON DOE
007 STEPHEN HARRY
008 STIPHEN HURRY
009 STEPHEN HRYY
最后得到这个:
001 JOHN DOE
002 STEPHEN HARRY
003 STIPHEN HURRY
004 SIRIUS BLACK
答案 0 :(得分:4)
对于agrep
部分,这是一种方法 - 您可以使用参数来调整结果:
sim <- setNames(lapply(1:nrow(df), function(i) agrep(df$names[i], df$names, max.distance = list(all=2, insertions=2, deletions=2, substitutions=0))), df$names)
sim <- lapply(sim, function(x) unique(df$names[x]))
df$names2 <- sapply(sim, "[", 1)
df[!duplicated(df$names2), ]
# sno names names2
# 1 1 SIRIS BLACK SIRIS BLACK
# 2 2 JOHN DOE JOHN DOE
# 3 3 STEPHEN HRYY STEPHEN HRYY
# 8 8 STIPHEN HURRY STIPHEN HURRY
答案 1 :(得分:0)
这是另一种方法。它使用RecordLinkage包并找到最短形式的有序向量。您可以调整您的阈值水平。
structure(list(sno = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
7L, 8L), .Label = c("JHN", "JOHN", "JON", "SIRIS", "SIRIUS",
"SIRUS", "STEPHEN", "STIPHEN"), class = "factor"), names = structure(c(2L,
2L, 2L, 1L, 1L, 1L, 3L, 4L, 5L), .Label = c("BLACK", "DOE", "HARRY",
"HRYY", "HURRY"), class = "factor"), both.names = c("JHN DOE",
"JOHN DOE", "JON DOE", "SIRIS BLACK", "SIRIUS BLACK", "SIRUS BLACK",
"STEPHEN HARRY", "STEPHEN HRYY", "STIPHEN HURRY")), .Names = c("sno",
"names", "both.names"), row.names = c("009", "002", "006", "001",
"004", "005", "007", "003", "008"), class = "data.frame")
library("RecordLinkage")
compareJW <- function(string, vec, cutoff) {
require(RecordLinkage)
jarowinkler(string, vec) > cutoff
}
shortenFirms <- function(firms, cutoff) {
shortnames <- firms[1]
firms <- firms[-1]
for (firm in firms) {
if (is.na(firm)) { # no firm name, so short-circuit and add an NA
shortnames <- c(shortnames, NA)
next
}
unique.short <- unique(shortnames[!is.na(shortnames)])
hits <- compareJW(firm, unique.short, cutoff)
if (sum(hits) > 1) {
warning(paste("cassifyFirms: more than one match for", firm))
shortnames <- c(shortnames, NA)
} else if (sum(hits) == 0) {
shortnames <- c(shortnames, firm)
} else {
shortnames <- c(shortnames, unique.short[hits])
}
}
shortnames
}
shortenFirms(df$both.names, 0.8)
shortenFirms(df $ both.names,0.8)
[1] "JHN DOE" "JHN DOE" "JHN DOE" "SIRIS BLACK" "SIRIS BLACK" "SIRIS BLACK" "STEPHEN HARRY"
[8] "STEPHEN HARRY" "STEPHEN HARRY"