如何使用R替换表中拼写错误的单词

时间:2017-07-24 12:45:18

标签: r

我有一个表,其中列有一些拼写错误的字符串,举个例子说:

table$Status会返回以下值:

"alive" "sic" "alive" "sick" "alive" "si" "alive" "ali"   "alv"  
"dead" "alive" "alive" "alive" "al"    "dead"  "dead"  "de"    "dead" 
"dead"  "dea"   "dead"  "al"   "dead"  "de"    "al"  "de"    "sick" 
"dead"  "alive"

我希望活着生病,如下例所示:

"alive" "sick" "alive" "sick" "alive" "sick" "alive" "alive"   "alive"  
"dead" "alive" "alive" "alive" "alive"    "dead"  "dead"  "dead"    "dead" 
"dead"  "dead"   "dead"  "alive"   "dead"  "dead"    "alive"  "dead"    "sick" 
"dead"  "alive"

我知道包RecordLinkage中有这个函数来获取字符串之间的距离,如:

levenshteinSim("al", "alive")

所以我将每个单独的值与另一个值进行比较并获得最佳相似性,我也知道使用table(Table$Status)我会得到最重复值的数量,这些将是正确的。

但是我的问题是如何将它们相互比较并替换我的表?如果有人知道一个简单的方法,那将非常有帮助。

1 个答案:

答案 0 :(得分:1)

library(data.table)
library(dplyr)
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si",  "de",   "al"  ))
table[,Status2:=ifelse(Status%like%"^al","alive",
                      ifelse(Status%like%"^si","sick","dead"))]

<强>更新

更通用的解决方案:

library(data.table)

table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si",  "de",   "al"  ))

correct_values <- c("alive","sick","dead")
for (i in 1:nrow(table)){ # i <- 2
  string <- table[i,Status]
  max <- 0
  similarity <- 0
  for(j in correct_values){ # j <- "alive"
    similarity <-   length(Reduce(intersect, strsplit(c(string, j), split = "")))
    if(similarity > max){
      max <- similarity
      to_replace <- j
    }
  }
  table[i,"Status"] <- to_replace
}

这里我假设您知道哪些值是更正的值(因此您手动输入correct_values。这将使用Status中的值替换列correct_values中的值。具有最多共同字符的人。

我希望它有所帮助!