如何在str_replace_all和hunspell_suggest上使用lapply替换所有拼写错误的单词?

时间:2019-05-07 23:55:06

标签: r stringr hunspell

我正在尝试找出如何将str_replace_allhunspell_suggest合并在一起。这是我目前的情况:

我有一个看起来像这样的数据框:

library(hunspell)
df1 <- data.frame("Index" = 1:7, "Text" = c("Brad came to dinner with us tonigh.",
                                            "Wuld you like to trave with me?",
                                            "There is so muh to undestand.",
                                            "Sentences cone in many shaes and sizes.",
                                            "Learnin R is fun",
                                            "yesterday was Friday",
                                            "bing search engine"))

这是我的代码,用于识别列中拼写错误的单词:

df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)

但是,在使用hunspell_suggest的第一个建议来替换拼写错误的单词时,我感到很困惑

我尝试了以下代码,但是它只能执行1行,并且只能处理具有1个拼写错误的单词,例如:

df1$replace <- str_replace_all(df1$Text, df1$word_check[[1]], hunspell_suggest(df1$word_check[[1]])[[1]][1])

我不确定如何将lapply合并到上面的代码中,以基于hunspell_suggest的第一个建议有效地替换所有拼写错误的单词,而将那些正确的单词留在外面。

谢谢。

2 个答案:

答案 0 :(得分:1)

这是使用DataCombine软件包的一种解决方案:

library(DataCombine)

# vector of words to replace
wrong <- unlist(hunspell(df1$Text))
# vector of the first suggested words
correct <- sapply(wrong, function(x) hunspell_suggest(x)[[1]][1])

Replaces <- data.frame(from = wrong, to = correct)

FindReplace(data = df1, Var = "Text", replaceData = Replaces,
                       from = "from", to = "to", exact = FALSE)

#Index                                   Text
#1     1   Brad came to dinner with us tonight.
#2     2        Wald you like to trace with me?
#3     3         There is so hum to understand.
#4     4 Sentences cone in many shes and sizes.
#5     5                      Learning R is fun
#6     6                   yesterday was Friday
#7     7                     bung search engine

答案 1 :(得分:1)

尽管现在这种情况已解决,但让我为您保留另一个选择。您尝试使用str_replace_all()。我改用stri_replace_all_fixed()。第一步是识别坏词并将其存储在badwords中。第二步是使用hunspell_suggest()中的sapply()为每个单词提取第一个建议,并将其存储在suggestions中。最后,我在stri_replace_all_fixed()中使用了这两个向量。

library(dplyr)
library(stringi)
library(hunspell)

df1 <- data.frame("Index" = 1:7, "Text" = c("Brad came to dinner with us tonigh.",
                                            "Wuld you like to trave with me?",
                                            "There is so muh to undestand.",
                                            "Sentences cone in many shaes and sizes.",
                                            "Learnin R is fun",
                                            "yesterday was Friday",
                                            "bing search engine"),
                  stringsAsFactors = FALSE)

# Get bad words.
badwords <- hunspell(df1$Text) %>% unlist

# Extract the first suggestion for each bad word.
suggestions <- sapply(hunspell_suggest(badwords), "[[", 1)

mutate(df1, Text = stri_replace_all_fixed(str = Text,
                                          pattern = badwords,
                                          replacement = suggestions,
                                          vectorize_all = FALSE)) -> out

#  Index                                   Text
#1     1   Brad came to dinner with us tonight.
#2     2        Wald you like to trace with me?
#3     3         There is so hum to understand.
#4     4 Sentences cone in many shes and sizes.
#5     5                      Learning R is fun
#6     6                   yesterday was Friday
#7     7                     bung search engine