我正在尝试使用R的tm包来渲染我的文本数据。
现在我的数据语料库采用以下形式:
1. The sports team practiced today
2. The soccer team went took the day off
然后数据将被矢量化为:
<the, sports, team, practiced, today, soccer, went, took, off>
1. <1, 1, 1, 1, 1, 0, 0, 0, 0>
2. <1, 0, 1, 0, 0, 1, 1, 1, 1>
我更喜欢为我的矢量使用一组自定义短语,例如:
<sports team, soccer team, practiced today, day off>
1. <1, 0, 1, 0>
2. <0, 1, 0, 1>
R中是否有包或函数可以执行此操作?或者是否有其他具有类似功能的开源资源?谢谢。
答案 0 :(得分:2)
您询问了其他文字套餐,我们欢迎您尝试使用Paul Nulty开发的quanteda
。
在下面的代码中,首先定义所需的多字短语,作为使用dictionary()
构造函数键入quanteda“dictionary”类对象的命名列表,然后使用{{1将您文本中的短语转换为单个“标记”,由下划线连接的短语组成。令牌器忽略了下划线,因此您的短语将被视为单字标记。
phrasetotoken()
是文档特征矩阵的构造函数,可以采用定义要保留的特征的正则表达式,这里包含下划线字符的任何短语(正则表达式当然可以改进但我保留了它这里故意简单)。 dfm()
有很多选项 - 请参阅dfm()
。
?dfm
很高兴为您解决任何install.packages("quanteda")
library(quanteda)
mytext <- c("The sports team practiced today",
"The soccer team went took the day off")
myphrases <- dictionary(list(myphrases=c("sports team", "soccer team", "practiced today", "day off")))
mytext2 <- phrasetotoken(mytext, myphrases)
mytext2
## [1] "The sports_team practiced_today" "The soccer_team went took the day_off"
# keptFeatures is a regular expression: keep only phrases
mydfm <- dfm(mytext2, keptFeatures = "_", verbose=FALSE)
mydfm
## Document-feature matrix of: 2 documents, 4 features.
## 2 x 4 sparse Matrix of class "dfmSparse"
## features
## docs day_off practiced_today soccer_team sports_team
## text1 0 1 0 1
## text2 1 0 1 0
相关问题,包括功能请求,如果您可以建议改进词组处理。
答案 1 :(得分:0)
这样的事情怎么样?
library(tm)
text <- c("The sports team practiced today", "The soccer team went took the day off")
corpus <- Corpus(VectorSource(text))
tokenizing.phrases <- c("sports team", "soccer team", "practiced today", "day off")
phraseTokenizer <- function(x) {
require(stringr)
x <- as.character(x) # extract the plain text from the tm TextDocument object
x <- str_trim(x)
if (is.na(x)) return("")
phrase.hits <- str_detect(x, ignore.case(tokenizing.phrases))
if (any(phrase.hits)) {
# only split once on the first hit, so we don't have to worry about multiple occurences of the same phrase
split.phrase <- tokenizing.phrases[which(phrase.hits)[1]]
# warning(paste("split phrase:", split.phrase))
temp <- unlist(str_split(x, ignore.case(split.phrase), 2))
out <- c(phraseTokenizer(temp[1]), split.phrase, phraseTokenizer(temp[2])) # this is recursive, since f() calls itself
} else {
out <- MC_tokenizer(x)
}
# get rid of any extraneous empty strings, which can happen if a phrase occurs just before a punctuation
out[out != ""]
}
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = phraseTokenizer))
> Terms(tdm)
[1] "day off" "practiced today" "soccer team" "sports team" "the" "took"
[7] "went"