我目前正在处理一个大型数据框,该数据框的每一行包含许多文本,并希望使用hunspell
包有效地识别和替换每个句子中的拼写错误的单词。我能够识别拼写错误的单词,但无法弄清楚列表上的hunspell_suggest
的用法。
以下是数据框的示例:
df1 <- data.frame("Index" = 1:7, "Text" = c("A complec sentence joins an independet",
"Mary and Samantha arived at the bus staton before noon",
"I did not see thm at the station in the mrning",
"The participnts read 60 sentences in radom order",
"how to fix mispelled words in R languge",
"today is Tuesday",
"bing sports quiz"))
我将文本列转换为字符,并使用hunspell
来识别每一行中的拼写错误的单词。
library(hunspell)
df1$Text <- as.character(df1$Text)
df1$word_check <- hunspell(df1$Text)
我尝试过
df1$suggest <- hunspell_suggest(df1$word_check)
但它一直显示此错误:
Error in hunspell_suggest(df1$word_check) :
is.character(words) is not TRUE
我对此并不陌生,所以我不确定使用hunspell_suggest
函数的建议列会如何变化。任何帮助将不胜感激。
答案 0 :(得分:1)
检查您的中间步骤。 df1$word_check
的输出如下:
List of 5
$ : chr [1:2] "complec" "independet"
$ : chr [1:2] "arived" "staton"
$ : chr [1:2] "thm" "mrning"
$ : chr [1:2] "participnts" "radom"
$ : chr [1:2] "mispelled" "languge"
,其类型为list
。如果您做了lapply(df1$word_check, hunspell_suggest)
,则可以获得建议。
编辑
由于我没有发现任何简单的选择,因此我决定对这个问题进行更详细的介绍。这是我想出的:
cleantext = function(x){
sapply(1:length(x),function(y){
bad = hunspell(x[y])[[1]]
good = unlist(lapply(hunspell_suggest(bad),`[[`,1))
if (length(bad)){
for (i in 1:length(bad)){
x[y] <<- gsub(bad[i],good[i],x[y])
}}})
x
}
尽管可能有一种更优雅的方法,但此函数返回的字符串矢量经过如下校正:
> df1$Text
[1] "A complec sentence joins an independet"
[2] "Mary and Samantha arived at the bus staton before noon"
[3] "I did not see thm at the station in the mrning"
[4] "The participnts read 60 sentences in radom order"
[5] "how to fix mispelled words in R languge"
[6] "today is Tuesday"
[7] "bing sports quiz"
> cleantext(df1$Text)
[1] "A complex sentence joins an independent"
[2] "Mary and Samantha rived at the bus station before noon"
[3] "I did not see them at the station in the morning"
[4] "The participants read 60 sentences in radon order"
[5] "how to fix misspelled words in R language"
[6] "today is Tuesday"
[7] "bung sports quiz"
请注意,因为这会返回hunspell
给出的第一个建议-可能正确也可能不正确。