如何用正确的单词列表替换拼写错误的单词列表?

时间:2019-01-30 21:04:17

标签: r text-mining

我正在尝试找出如何从正确的单词列表中替换一长串的拼写错误的单词,但不确定如何做到这一点。如果可能,请告知。谢谢。

我尝试了str_replace和gsub,但似乎是因为我想从数据帧实现更改,因此它实际上并不能那样工作。

df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers"))

correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain American", "Avengers"))

我希望输出如下:

df = tibble(Movie_Name = list("Black Panther", "Iron Man", "Captain America", "Black Panther", "Iron Man", "Captain America", "Avengers"))

4 个答案:

答案 0 :(得分:1)

一种方法可能是使用Levenshtein距离,该距离可从stringdist包中获得。

library(stringdist)

MovieNames   = unlist(df$Movie_Name)
CorrectNames = unlist(correct$correct_movie_name)

for(MN in MovieNames) {
    CMN = which.min(stringdist(CorrectNames,  MN, method = "lv"))
    cat(MN, " should be ",  CorrectNames[CMN], "\n")
}

Black Panthet  should be  Black Panther 
Irom Man  should be  Iron Man 
Captain Anerica  should be  Captain American 
Black Panthers  should be  Black Panther 
Iron Men  should be  Iron Man 
Captain America  should be  Captain American 
Avangers  should be  Avengers 

答案 1 :(得分:0)

我认为没有完美的解决方案。最好的办法是计算Movie_Namecorrect_movie_name之间的某种编辑距离,并用距离最小的correct_movie_name中的单词替换。使用哪种度量标准在很大程度上取决于情况,并且需要进行大量调整。在这里,我使用了stringdist包中的stringdist函数,该函数具有多种距离度量可供选择。默认值为“限制的Damerau-Levenshtein距离”(距?stringdist)。我们还可以使用levenshsteinDist包中的RecordLinkage

library(dplyr)
library(stringdist)
library(RecordLinkage)

replace_names <- function(vec, replace_list, dist_func){
  map_chr(vec, ~{
    replace_list[which.min(dist_func(.x, replace_list))]
  })
}

df %>%
  mutate(Correct_stringdist = replace_names(Movie_Name, correct$correct_movie_name, stringdist),
         Correct_levenshsteinDist = replace_names(Movie_Name, correct$correct_movie_name, levenshteinDist))

输出:

# A tibble: 7 x 3
  Movie_Name      Correct_stringdist Correct_levenshsteinDist
  <chr>           <chr>              <chr>                   
1 Black Panthet   Black Panther      Black Panther           
2 Irom Man        Iron Man           Iron Man                
3 Captain Anerica Captain American   Captain American        
4 Black Panthers  Black Panther      Black Panther           
5 Iron Men        Iron Man           Iron Man                
6 Captain America Captain American   Captain American        
7 Avangers        Avengers           Avengers 

答案 2 :(得分:0)

agrep功能,可以执行串之间近似匹配。

df = tibble(Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerican", "Black Panthers", "Iron Men", "Captain America", "Avangers"))

correct = tibble(correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers"))

df2 = tibble( Movie_Name = sapply(df$Movie_Name, function(x){
                  for(i in correct$correct_movie_name){
                    comparison <- agrep(i, x)
                    if(length(comparison) != 0){
                      if(comparison == 1){
                      return(i)
                    }}
                  }
                  return(x)
                }))

答案 3 :(得分:0)

这是基于@ G5W和avid_useR的答案的解决方案

library(tidyverse)
library(stringdist)

Movie_Name = list("Black Panthet", "Irom Man", "Captain Anerica", "Black Panthers", "Iron Men", "Captain America", "Avangers")

correct_movie_name = list("Black Panther", "Iron Man", "Captain America", "Avengers")

New_Movie_name <- lapply(Movie_Name, function(x) {
  lapply(correct_movie_name, function(y) {
    stringdist(x,y)
  }) %>% unlist() %>% which.min() %>% correct_movie_name[[.]]
})

# New_Movie_name is a list of the same length as Movie_Name but with correct movie names based on elements in list correct_movie_name