我想阻止每个单词。例如,“ hardworking employees
”应转换为“ hardwork employee
”而不是“ hardworking employee
”。用简单的话来说,它应该分别阻止两个词。我知道这没有道理。但这是一个例子。实际上,我有这样的词干是有道理的。
我具有使用定界符','来考虑单词,然后执行词干分析的功能。我希望对其进行修改,以便对','分隔符内的所有单词进行词干搜索。
dt = read.table(header = TRUE,
text ="Word Synonyms
employee 'hardworking employees, intelligent employees, employment, employee'
lover 'loved ones, loving boy, lover'
", stringsAsFactors= F)
library(SnowballC)
library(parallel)
stem_text3<- function(text, language = "english", mc.cores = 3) {
stem_string <- function(str, language) {
str <- strsplit(x = str, split = "\\,")
str <- wordStem(unlist(str), language = language)
str <- paste(str, collapse = ",")
return(str)
}
# stem each text block in turn
x <- mclapply(X = text, FUN = stem_string, language)
# return stemed text blocks
return(unlist(x))
}
df000 <- data.frame(stringsAsFactors = F)
for (i in 1:nrow(dt)){
sent = dt[i, "Synonyms"]
k = data.frame(r_synonyms = stem_text3(sent, language = 'en'), stringsAsFactors = F)
df000= rbind(df000,k)
}
答案 0 :(得分:2)
这很棘手,因为SnowballC::wordStem()
阻止了字符向量的每个元素,因此需要对字符向量进行拆分和重组才能使用它。
我将放弃循环,并使用Apply操作对其进行矢量化(您可以将其交换为mclapply()
。
library("stringi")
dt[["Synonyms"]] <-
sapply(stri_split_fixed(dt[["Synonyms"]], ","), function(x) {
x <- lapply(stri_split_fixed(stri_trim_both(x), " "), function(y) {
paste(SnowballC::wordStem(y), collapse = " ")
})
paste(x, collapse = ", ")
})
dt
## Word Synonyms
## 1 employee hardwork employe, intellig employe, employ, employe
## 2 lover love on, love boi, lover
注意:
首先,我认为这不是您对茎的期望,但这就是Porter茎杆在 SnowballC 中实现的工作方式。
第二,有更好的方法来整体解决此问题,但是除非您在提出此问题时说明您的目标,否则我无法真正回答。例如,要在 quanteda 中替换一组短语(用通配符代替词干),您可以执行以下操作:
library("quanteda")
thedict <- dictionary(list(
employee = c("hardwork* employ*", "intellig* employ*", "employment", "employee*"),
lover = c("lov* ones", "lov* boy", "lover*")
))
tokens("Some employees are hardworking employees in useful employment.
They support loved osuch as their wives and lovers.") %>%
tokens_lookup(dictionary = thedict, exclusive = FALSE, capkeys = FALSE)
## tokens from 1 document.
## text1 :
## [1] "Some" "employee" "are" "employee" "in" "useful" "employee"
## [8] "." "They" "support" "loved" "osuch" "as" "their"
## [15] "wives" "and" "lover" "."