Question

我在这个问题上找到了变种，但无法在我的情况下得到建议的解决方案。我对R很陌生，没有其他编码经验，所以我可能只是缺少一些基本的东西。谢谢你的帮助!!

我有一个包含组织名称列的数据表，称之为Orgs $ OrgName。有时组成组织名称的字符串中会出现拼写错误的单词。我有一个查找表（从csv导入，一列中有常见的拼写错误（拼写$ misspelt），另一列中的更正（拼写正确）。

我想找到与拼写$ misspelt匹配的OrgName字符串的任何部分，并用拼写$ correct替换那些部分。

我尝试了基于mgsub的各种解决方案，stri_replace_all_fixed，str_replace_all（replacement of words in strings一直是我的主要参考）。但没有任何效果，并且所有示例似乎都是基于使用vect1＆lt; - c（“item1”，“item2”，item3“）而不是基于查找表的手动创建的向量。

我的数据示例：

                                         OrgName
1:                         WAIROA DISTRICT COUNCIL
2:                         MANUTAI MARAE COMMITTEE
3:                                C S AUTOTECH LTD
4:                  NEW ZEALAND INSTITUTE OF SPORT
5:                                 BRAUHAUS FRINGS
6:   CHRISTCHURCH YOUNG MENS CHRISTIAN ASSOCIATION

查找表：

    mispelt         correct 
1 ABANDONNED       ABANDONED            
2  ABERATION      ABERRATION            
3  ABILITYES       ABILITIES            
4   ABILTIES       ABILITIES            
5     ABILTY         ABILITY            
6    ABONDON         ABANDON

（组织名称的前几行没有拼写错误，但数据集中有57000多个）

更新：这是我根据第二个响应的更新尝试的（首先尝试，因为它更简单）。它没有用，但希望有人能看出它出了什么问题？

library("stringi")
Orgs <- data.frame(OrgNameClean$OrgNameClean)
head(Orgs)
head(OrgNameClean)

write.csv(spelling$mispelt,file = "wrong.csv")
write.csv(spelling$correctspelling,file = "corrected.csv")

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")
head(patterns)
head(replacements)

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  #print(stri_replace_all_fixed(row, patterns, replacements, 
vectorize_all=FALSE))
  row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% 
"\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

head(Orgs)
Orgsdt <- data.table(Orgs)
head(Orgsdt)
chckspellchk <- Orgsdt[OrgNameClean.OrgNameClean %like% "ENVIORNMENT",,] 
##should return no rows if spelling correction worked
head(chckspellchk)

#OrgNameClean.OrgNameClean
#1:   SMART ENVIORNMENTAL LTD

更新2：更多信息 - 拼写查找中有空格，如果这会产生影响：

> head(spelling[mispelt %like% " ",,])
     mispelt correctspelling 
1: COCA COLA            COCA            
2:   TORTISE        TORTOISE      

> head(spelling[correctspelling %like% " "])
    mispelt correctspelling  
1:   ABOUTA         ABOUT A             
2:  ABOUTIT        ABOUT IT             
3: ABOUTTHE       ABOUT THE             
4:     ALOT           A LOT       
5: ANYOTHER       ANY OTHER             
6:    ASFAR          AS FAR

Answer 1

这个答案对于一个新的程序员来说可能太复杂了，而且我可能写的这个更像Python而不是R（我对后者有点生疏）*但是我有一个针对你的问题的建议解决方案，顺便说一句，这并不重要。我预见到你遇到的其他解决方案的问题是，它们只能解决较大拼图的一小部分问题，即你需要能够检查每个字符串中的每个字对你的查找表。我认为这样做的最简单方法是编写一些小函数来完成你需要的工作，然后使用R的apply functions族来遍历条目并使用函数。

另一个棘手的问题是使用R environment作为查找表。无论出于什么原因，R人似乎都不太谈论或真正使用哈希表（查找表的真实姓名），但它们在其他语言中非常常见。幸运的是，environments实际上只是C哈希表的一个实现，这很好，因为哈希非常快，并且允许您直接将一个值映射到另一个值。（More on this here，如果有兴趣的话。）

* ^{我欢迎来自其他人的评论或编辑，这些评论或编辑会使我的答案更简单或更具R-idiomatic}

# Some example data - note stringsAsFactors=FALSE is critical for this to work
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 'C S AUTOTECH LTD', 
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

spelling_df <- data.frame("Mistake" = c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW'),
                          "Correct"= c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD'),
                       stringsAsFactors = FALSE)


# Function to convert your data frame to a hash table
create_hash <- function(in_df){
  hash_table <- new.env(hash=TRUE)
  for(i in seq(nrow(in_df)))
  {
    hash_table[[in_df[i, 1]]] <- in_df[i, 2]
  }
  return(hash_table)
}

# Make the hash table out of your data frame
spelling_hash <- create_hash(spelling_df)

# Try it out:
print(spelling_hash[['ABANDONNED']])  # prints ABANDONED

# Now make a function to apply the lookup - and ensure
# if the string is not in the lookup table, you return the 
# original string instead (instead of NULL)
apply_hash <- function(in_string, hash_table=spelling_hash){
  x = hash_table[[in_string]]
  if(!is.null(x)){
    return(x)
  }
  else{
    return(in_string)
  }
}

# Finally make a function to break the full company name apart, 
# apply the lookup on each word, and then paste it back together
correct_spelling <- function(bad_string) {
  split_string <- strsplit(as.character(bad_string), " ")
  new_split <- lapply(split_string[[1]], apply_hash)
  return(paste(new_split, collapse=' '))
}

# Make a new field that applies the spelling correction
Orgs$Corrected <- sapply(Orgs$OrgName, correct_spelling)

Answer 2

我们可以使用stringi的stri_replace_*_all()对整个字符串进行多次替换。

library("stringi")
string <- "WAIROA ABANDONNED COUNCIL','C S AUTOTECH LTD', 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES"
mistake <- c('ABANDONNED', 'ABERATION', 'ABILITYES', 'NEW')
corrected <- c('ABANDONED', 'ABERRATION', 'ABILITIES', 'OLD')

stri_replace_all_fixed(string, patterns, replacements, vectorize_all=FALSE)    
stri_replace_all_regex(string, "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)

输出：

[1] "WAIROA ABANDONED COUNCIL','C S AUTOTECH SGM', 'OLD ZEALAND INSTITUTE OF ABERRATION ABILITIES"

一些注意事项：

stri_replace_all_fixed替换固定模式匹配的出现次数。
stri_replace_all_regex使用正则表达式模式。这允许我们指定单词边界：\b以避免子字符串匹配（\bword\b的替代(?<=\W)word(?=\W)）。
vectorize_all设置为FALSE，否则每个替换都应用于原始句子的新副本。详情请见here。

完整样本：

library("stringi")
Orgs <- data.frame("OrgName" = c('WAIROA ABANDONNED COUNCIL', 
                                 ' SMART ENVIORNMENTAL LTD',
                                 'NEW ZEALAND INSTITUTE OF ABERATION ABILITYES'),
                   stringsAsFactors = FALSE)

patterns <- readLines("wrong.csv")
replacements <- readLines("corrected.csv")

for(i in 1:nrow(Orgs)) {
  row <- Orgs[i,]
  print(as.character(row))
  row <- stri_replace_all_fixed(row, patterns, replacements, vectorize_all=FALSE)
  #row <- stri_replace_all_regex(as.character(row), "\\b" %s+% patterns %s+% "\\b", replacements, vectorize_all=FALSE)
  print(row)
  Orgs[i,] <- row
}

PS：我为每个字符向量创建了一个单独的CSV，其中包含一个无头列。但是还有许多其他方法可以用R读取CSV并将列转换为char矢量。

PS2：如果你想要子串匹配，例如。 ENVIORNMENT中的匹配ENVIORNMENTAL不要将stri_replace_all_regex()与字边界\b一起使用。 This是一个很好的教程，可以提升你的正则表达能力。

R根据查找表

（组织名称的前几行没有拼写错误，但数据集中有57000多个）

2 个答案: