这是一个(似乎)工作的代码,但我正在寻找缩短代码的方法。它实现了自定义词干或词汇分组。
# Simple reproducible example:
library(tm)
vec <- c("partners, very good", "partnery SOso Goodish!", "partna goodies",
"Good night")
corp <- Corpus(VectorSource(vec))
corp <- tm_map(corp, tolower)
corp <- tm_map(corp, removePunctuation)
# Custom stemming (how to shorten this code and avoid reptition)
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = "good[^ ]*",
replacement = "good"
)
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = "partn[^ ]*",
replacement = "partn"
)
背景:我不能使用标准的词干方法,因为:
编辑
我已经达到了一个令人满意且更具可扩展性的解决方案,但我仍然觉得这不是应该这样做的方式......
# Make a list of pattern/replacment pairs
steml <- list(
c("good[^ ]*", "good"),
c("partn[^ ]*", "partn")
)
for (pair in seq_along(steml)) {
corp <- tm_map(
corp,
content_transformer(gsub),
pattern = steml[[pair]][1],
replacement = steml[[pair]][2]
)
}