我正在尝试找出如何从正确的单词列表中替换一长串的拼写错误的单词,但不确定如何做到这一点。如果可能,请告知。谢谢。
我尝试了str_replace和gsub,但似乎是因为我想从数据帧实现更改,因此它实际上并不能那样工作。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain American", "Avengers"))
我希望输出如下:
df = tibble(Movie_Name = list("Black Panther", "Iron Man", "Captain America", "Black Panther", "Iron Man", "Captain America", "Avengers"))
答案 0 :(得分:1)
一种方法可能是使用Levenshtein距离,该距离可从stringdist
包中获得。
library(stringdist)
MovieNames = unlist(df$Movie_Name)
CorrectNames = unlist(correct$correct_movie_name)
for(MN in MovieNames) {
CMN = which.min(stringdist(CorrectNames, MN, method = "lv"))
cat(MN, " should be ", CorrectNames[CMN], "\n")
}
Black Panthet should be Black Panther
Irom Man should be Iron Man
Captain Anerica should be Captain American
Black Panthers should be Black Panther
Iron Men should be Iron Man
Captain America should be Captain American
Avangers should be Avengers
答案 1 :(得分:0)
我认为没有完美的解决方案。最好的办法是计算Movie_Name
和correct_movie_name
之间的某种编辑距离,并用距离最小的correct_movie_name
中的单词替换。使用哪种度量标准在很大程度上取决于情况,并且需要进行大量调整。在这里,我使用了stringdist
包中的stringdist
函数,该函数具有多种距离度量可供选择。默认值为“限制的Damerau-Levenshtein距离”(距?stringdist
)。我们还可以使用levenshsteinDist
包中的RecordLinkage
:
library(dplyr)
library(stringdist)
library(RecordLinkage)
replace_names <- function(vec, replace_list, dist_func){
map_chr(vec, ~{
replace_list[which.min(dist_func(.x, replace_list))]
})
}
df %>%
mutate(Correct_stringdist = replace_names(Movie_Name, correct$correct_movie_name, stringdist),
Correct_levenshsteinDist = replace_names(Movie_Name, correct$correct_movie_name, levenshteinDist))
输出:
# A tibble: 7 x 3
Movie_Name Correct_stringdist Correct_levenshsteinDist
<chr> <chr> <chr>
1 Black Panthet Black Panther Black Panther
2 Irom Man Iron Man Iron Man
3 Captain Anerica Captain American Captain American
4 Black Panthers Black Panther Black Panther
5 Iron Men Iron Man Iron Man
6 Captain America Captain American Captain American
7 Avangers Avengers Avengers
答案 2 :(得分:0)
在agrep
功能,可以执行串之间近似匹配。
df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerican", "Black Panthers", "Iron Men", "Captain America", "Avangers"))
correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers"))
df2 = tibble( Movie_Name = sapply(df$Movie_Name, function(x){
for(i in correct$correct_movie_name){
comparison <- agrep(i, x)
if(length(comparison) != 0){
if(comparison == 1){
return(i)
}}
}
return(x)
}))
答案 3 :(得分:0)
这是基于@ G5W和avid_useR的答案的解决方案
library(tidyverse)
library(stringdist)
Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers")
correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers")
New_Movie_name <- lapply(Movie_Name, function(x) {
lapply(correct_movie_name, function(y) {
stringdist(x,y)
}) %>% unlist() %>% which.min() %>% correct_movie_name[[.]]
})
# New_Movie_name is a list of the same length as Movie_Name but with correct movie names based on elements in list correct_movie_name