Question

我已经应用了tm包中的各种清洁功能，例如删除标点符号，数字，特殊字符，常用英语单词等，并获得了如下所示的数据框。请记住，我没有主要密钥，例如cust_id或account_number可以依赖

sno        names
001        SIRIS BLACK
002        JOHN DOE
003        STEPHEN HRYY
004        SIRIUS BLACK
005        SIRUS BLACK
006        JON DOE
007        STEPHEN HARRY
008        STIPHEN HURRY
009        JHN DOE

看看上面的数据，我真的觉得模式有相似之处，而且这些名字彼此接近。如何使用R的可用文本挖掘函数计算模式相等的百分比，以便最终获得具有所有唯一名称的数据框？

假设和缺点：

直截了当地假设唯一名称可能是具有最大字符的名称，因为我所拥有的原始数据在名称上有大量错别字。（逻辑假设，也许会减少拼写错误的数量）
agrep()函数在大字符串中搜索与模式的近似匹配，这里的问题是我实际上不知道模式是什么。

将类似的字符串分组如下：

sno        names
001        SIRIS BLACK          
002        SIRIUS BLACK
003        SIRUS BLACK
004        JHN DOE
005        JOHN DOE
006        JON DOE
007        STEPHEN HARRY
008        STIPHEN HURRY
009        STEPHEN HRYY

最后得到这个：

001     JOHN DOE
002     STEPHEN HARRY
003     STIPHEN HURRY
004     SIRIUS BLACK

Answer 1

对于agrep部分，这是一种方法 - 您可以使用参数来调整结果：

sim <- setNames(lapply(1:nrow(df), function(i) agrep(df$names[i], df$names, max.distance = list(all=2, insertions=2, deletions=2, substitutions=0))), df$names)
sim <- lapply(sim, function(x) unique(df$names[x]))
df$names2 <- sapply(sim, "[", 1)
df[!duplicated(df$names2), ]
#   sno         names        names2
# 1   1   SIRIS BLACK   SIRIS BLACK
# 2   2      JOHN DOE      JOHN DOE
# 3   3  STEPHEN HRYY  STEPHEN HRYY
# 8   8 STIPHEN HURRY STIPHEN HURRY

Answer 2

这是另一种方法。它使用RecordLinkage包并找到最短形式的有序向量。您可以调整您的阈值水平。

structure(list(sno = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
7L, 8L), .Label = c("JHN", "JOHN", "JON", "SIRIS", "SIRIUS", 
"SIRUS", "STEPHEN", "STIPHEN"), class = "factor"), names = structure(c(2L, 
2L, 2L, 1L, 1L, 1L, 3L, 4L, 5L), .Label = c("BLACK", "DOE", "HARRY", 
"HRYY", "HURRY"), class = "factor"), both.names = c("JHN DOE", 
"JOHN DOE", "JON DOE", "SIRIS BLACK", "SIRIUS BLACK", "SIRUS BLACK", 
"STEPHEN HARRY", "STEPHEN HRYY", "STIPHEN HURRY")), .Names = c("sno", 
"names", "both.names"), row.names = c("009", "002", "006", "001", 
"004", "005", "007", "003", "008"), class = "data.frame")

library("RecordLinkage")
compareJW <- function(string, vec, cutoff) {
  require(RecordLinkage)
  jarowinkler(string, vec) > cutoff
}

shortenFirms <- function(firms, cutoff) {
  shortnames <- firms[1]
  firms <- firms[-1]

  for (firm in firms) {
    if (is.na(firm)) { # no firm name, so short-circuit and add an NA
      shortnames <- c(shortnames, NA)
      next

    }
    unique.short <- unique(shortnames[!is.na(shortnames)])
    hits <- compareJW(firm, unique.short, cutoff)
    if (sum(hits) > 1) {
      warning(paste("cassifyFirms: more than one match for", firm))
      shortnames <- c(shortnames, NA)
    } else if (sum(hits) == 0) {
      shortnames <- c(shortnames, firm)
    } else {
      shortnames <- c(shortnames, unique.short[hits])
    }
  }
  shortnames
}

shortenFirms(df$both.names, 0.8)

shortenFirms（df $ both.names，0.8）

[1] "JHN DOE"       "JHN DOE"       "JHN DOE"       "SIRIS BLACK"   "SIRIS BLACK"   "SIRIS BLACK"   "STEPHEN HARRY"
[8] "STEPHEN HARRY" "STEPHEN HARRY"

R文本挖掘：从数据框中对类似模式进行分组。

2 个答案: