替换类似拼写错误的单词

时间:2018-06-04 19:52:17

标签: r

我正在尝试修正我的调查数据。我的数据框包含多个应该相同的值;但是,拼写,间距和大写的变化会导致超过预期的级别数。

str(data.frame$race)
"American Indian and Alaska Native" 
"Asian"                             
"Black of African American"        
"Black or African American"         
"Other"                            
"Unknown"                          
"white or Caucasian"                
"White or Caucasian"                
"White or Caucasion" 

我如何找到并替换"创建一个统一的拼写并将其转换回具有适当级别数的因子?

1 个答案:

答案 0 :(得分:0)

很难找到适合所有解决方案的一种尺寸。这是因为看似相似的字符串可能描述了非常不同的东西(例如格拉纳达与格林纳达)。原帖的评论值得研究。

参见"Approximate string matching" on Wikipedia(有时也称为“模糊匹配”)。您可以通过多种方式在字符串上定义“相似”。

最基本的工具是R函数adist。它计算所谓的编辑距离。

x <- c("American Indian and Alaska Native" ,
   "Asian"                             ,
   "Black of African American"        ,
   "Black or African American"         ,
   "Other"                            ,
   "Unknown"                          ,
   "white or Caucasian"                ,
   "White or Caucasian"                ,
   "White or Caucasion" )
u <- unique(x)
# compare all strings against each other
d <- adist(u)
# Do not list combinations of similar words twice
d[lower.tri(d)] <- NA
# Say your threshold below which you want to consider strings similar is 
# 2 edits:
a <- which(d > 0 & d < 2, arr.ind = TRUE)
a
##      row col
## [1,]   3   4
## [2,]   7   8
## [3,]   8   9
pairs <- cbind(u[a[,1]], u[a[,2]])
pairs
##      [,1]                        [,2]                       
## [1,] "Black of African American" "Black or African American"
## [2,] "white or Caucasian"        "White or Caucasian"       
## [3,] "White or Caucasian"        "White or Caucasion" 

但最终你必须自己策划结果,以避免不平等因素的意外均等化。

您可以使用命名向量作为翻译词典来重复执行此操作。例如,通过查看上面的示例,我可以创建以下字典:

dict <- c(
   # incorrect spellings          correct spellings
   # -------------------------    ----------------------------
   "Black of African American" =  "Black or African American",
   "white or Caucasian"        =  "white or Caucasian"       ,
   "White or Caucasion"        =  "White or Caucasian" 
)
# The correct levels need to be included, to
dict <- c(dict, setNames(u,u)

然后使用as.character将您的系数列转换为字符并应用 就像我在这里使用原始字符向量x

那样的字典
xcorrected <- dict[x]
# show without names, but the result is also correct if you just use
# xcorrected alone (remove as.character here to see the difference).
as.character(xcorrected)
[1] "American Indian and Alaska Native" "Asian"                            
[3] "Black or African American"         "Black or African American"        
[5] "Other"                             "Unknown"                          
[7] "white or Caucasian"                "White or Caucasian"               
[9] "White or Caucasian"