我有一个表,其中列有一些拼写错误的字符串,举个例子说:
table$Status
会返回以下值:
"alive" "sic" "alive" "sick" "alive" "si" "alive" "ali" "alv"
"dead" "alive" "alive" "alive" "al" "dead" "dead" "de" "dead"
"dead" "dea" "dead" "al" "dead" "de" "al" "de" "sick"
"dead" "alive"
我希望活着,生病或死,如下例所示:
"alive" "sick" "alive" "sick" "alive" "sick" "alive" "alive" "alive"
"dead" "alive" "alive" "alive" "alive" "dead" "dead" "dead" "dead"
"dead" "dead" "dead" "alive" "dead" "dead" "alive" "dead" "sick"
"dead" "alive"
我知道包RecordLinkage
中有这个函数来获取字符串之间的距离,如:
levenshteinSim("al", "alive")
所以我将每个单独的值与另一个值进行比较并获得最佳相似性,我也知道使用table(Table$Status)
我会得到最重复值的数量,这些将是正确的。
但是我的问题是如何将它们相互比较并替换我的表?如果有人知道一个简单的方法,那将非常有帮助。
答案 0 :(得分:1)
library(data.table)
library(dplyr)
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" ))
table[,Status2:=ifelse(Status%like%"^al","alive",
ifelse(Status%like%"^si","sick","dead"))]
<强>更新强>
更通用的解决方案:
library(data.table)
table <- data.table(Status=c("alive", "sic", "alive", "sick", "alive", "si", "de", "al" ))
correct_values <- c("alive","sick","dead")
for (i in 1:nrow(table)){ # i <- 2
string <- table[i,Status]
max <- 0
similarity <- 0
for(j in correct_values){ # j <- "alive"
similarity <- length(Reduce(intersect, strsplit(c(string, j), split = "")))
if(similarity > max){
max <- similarity
to_replace <- j
}
}
table[i,"Status"] <- to_replace
}
这里我假设您知道哪些值是更正的值(因此您手动输入correct_values
。这将使用Status
中的值替换列correct_values
中的值。具有最多共同字符的人。
我希望它有所帮助!