我有一个短语列表,我想用一个相似的单词替换某些单词,以防拼写错误。
如何搜索字符串,匹配并替换它的单词?
预期结果如下:
a1<- c(" the classroom is ful ")
a2<- c(" full")
在这种情况下,我将替换完整 <1>
中的答案 0 :(得分:1)
我认为您正在寻找的功能是gsub():
gsub (pattern = "ful", replacement = a2, x = a1)
答案 1 :(得分:1)
查看hunspell
包。正如评论已经提出的那样,除非您已经有拼写错误的单词及其拼写正确的字典,否则您的问题要比看上去困难得多。
library(hunspell)
a1 <- c(" the classroom is ful ")
bads <- hunspell(a1)
bads
# [[1]]
# [1] "ful"
hunspell_suggest(bads[[1]])
# [[1]]
# [1] "fool" "flu" "fl" "fuel" "furl" "foul" "full" "fun" "fur" "fut" "fol" "fug" "fum"
因此,即使在您的示例中,您是否要将ful
替换为full
,或者此处还有许多其他选项?
该软件包允许您使用自己的字典。让我们说你正在这样做,或者至少你对第一个返回的建议感到满意。
library(stringr)
str_replace_all(a1, bads[[1]], hunspell_suggest(bads[[1]])[[1]][1])
# [1] " the classroom is fool "
但是,正如其他评论和答案所指出的那样,你需要小心显示其他词语中显示的词。
a3 <- c(" the thankful classroom is ful ")
str_replace_all(a3,
paste("\\b",
hunspell(a3)[[1]],
"\\b",
collapse = "", sep = ""),
hunspell_suggest(hunspell(a3)[[1]])[[1]][1])
# [1] " the thankful classroom is fool "
根据你的评论,你已经有了一个字典,结构化为坏词的向量和另一个替换的向量。
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
使用您的新示例解决您的评论问题回到了换句话说出现的问题。解决方案是使用\\b
。这代表一个单词边界。使用模式&#34;瘦&#34;它会匹配&#34;瘦&#34;,&#34;思考&#34;,&#34;思考&#34;等等。但如果你用\\b
括起来,它会将模式锚定到一个单词边界。 \\bthin\\b
只会匹配&#34; thin&#34;。
你的例子:
a <- c(" thin, thic, thi")
badwords.corpus <- c("thin", "thic", "thi" )
goodwords.corpus <- c("think", "thick", "this")
解决方案是修改badwords.corpus
badwords.corpus <- paste("\\b", badwords.corpus, "\\b", sep = "")
badwords.corpus
# [1] "\\bthin\\b" "\\bthic\\b" "\\bthi\\b"
然后按我在上一次更新中描述的那样创建vect.corpus,并在str_replace_all
中使用。
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a, vect.corpus)
# [1] " think, thick, this"
答案 2 :(得分:0)
创建更正列表,然后使用gsubfn
替换它们,gsub
是library(gsubfn)
L <- list(ful = "full") # can add more words to this list if desired
gsubfn("\\b\\w+\\b", L, a1, perl = TRUE)
## [1] " the classroom is full "
的概括,它也可以采用列表,函数和原型对象替换对象。正则表达式匹配单词边界,一个或多个单词字符和另一个单词边界。每次找到匹配项时,它会在列表名称中查找匹配项,如果找到则将其替换为相应的列表值。
UPDATE
答案 3 :(得分:0)
对于一种有序的替换,你可以试试这个
a1 <- c("the classroome is ful")
# ordered replacement
badwords.corpus <- c("ful", "classroome")
goodwords.corpus <- c("full", "classroom")
qdap::mgsub(badwords.corpus, goodwords.corpus, a1) # or
stringi::stri_replace_all_fixed(a1, badwords.corpus, goodwords.corpus, vectorize_all = FALSE)
对于无序替换,您可以使用近似字符串匹配(请参阅stringdist::amatch
)。这是一个例子
a1 <- c("the classroome is ful")
a1
[1] "the classroome is ful"
library(stringdist)
goodwords.corpus <- c("full", "classroom")
badwords.corpus <- unlist(strsplit(a1, " ")) # extract words
for (badword in badwords.corpus){
patt <- paste0('\\b', badword, '\\b')
repl <- goodwords.corpus[amatch(badword, goodwords.corpus, maxDist = 1)] # you can change the distance see ?amatch
final.word <- ifelse(is.na(repl), badword, repl)
a1 <- gsub(patt, final.word, a1)
}
a1
[1] "the classroom is full"