我知道我可以使用tm包来计算语料库中特定单词的出现次数:
require(tm)
data(crude)
dic <- Dictionary("crude")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE))
inspect(tdm)
我想知道是否有设施可以为Dictionary而不是固定词提供正则表达式?
有时词干可能不是我想要的(例如我可能想要解决拼写错误),所以我想做类似的事情:
dic <- Dictionary(c("crude",
"\\bcrud[[:alnum:]]+"),
"\\bcrud[de]")
因此继续使用tm包的设施?
答案 0 :(得分:3)
我不确定你是否可以将正则表达式放在字典函数中,因为它只接受字符向量或术语 - 文档矩阵。我建议的解决方法是使用正则表达式对术语 - 文档矩阵中的术语进行子集化,然后进行单词计数:
# What I would do instead
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE))
# subset the tdm according to the criteria
# this is where you can use regex
crit <- grep("cru", tdm$dimnames$Terms)
# have a look to see what you got
inspect(tdm[crit])
A term-document matrix (2 terms, 20 documents)
Non-/sparse entries: 10/30
Sparsity : 75%
Maximal term length: 7
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543
crucial 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
crude 2 0 2 3 0 2 0 0 0 0 5 2 0 2 0 0 0 2
Docs
Terms 704 708
crucial 0 0
crude 0 1
# and count the number of times that criteria is met in each doc
colSums(as.matrix(tdm[crit]))
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
2 0 2 3 0 2 2 0 0 0 5 2 0 2 0 0 0 2 0 1
# count the total number of times in all docs
sum(colSums(as.matrix(tdm[crit])))
[1] 23
如果这不是您想要的,请继续编辑您的问题,以包含一些正确代表您的实际用例的示例数据,以及您所需输出的示例。
答案 1 :(得分:2)
如果您指定valuetype = "regex"
,文本分析包 quanteda 允许使用正则表达式选择要素。
require(tm)
require(quanteda)
data(crude)
dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE)
# Document-feature matrix of: 20 documents, 2 features.
# 20 x 2 sparse Matrix of class "dfmSparse"
# features
# docs crude crucial
# 127 2 0
# 144 0 0
# 191 2 0
# 194 3 0
# 211 0 0
# 236 2 0
# 237 0 2
# 242 0 0
# 246 0 0
# 248 0 0
# 273 5 0
# 349 2 0
# 352 0 0
# 353 2 0
# 368 0 0
# 489 0 0
# 502 0 0
# 543 2 0
# 704 0 0
# 708 1 0
另见?selectFeatures
。