如何在TermDocumentMatrix中使用正则表达式进行文本挖掘?

时间:2013-08-22 14:18:21

标签: regex r text-mining tm

我知道我可以使用tm包来计算语料库中特定单词的出现次数:

require(tm)
data(crude)

dic <- Dictionary("crude")
tdm <- TermDocumentMatrix(crude, control = list(dictionary = dic, removePunctuation = TRUE))
inspect(tdm)

我想知道是否有设施可以为Dictionary而不是固定词提供正则表达式?

有时词干可能不是我想要的(例如我可能想要解决拼写错误),所以我想做类似的事情:

dic <- Dictionary(c("crude", 
                    "\\bcrud[[:alnum:]]+"),
                    "\\bcrud[de]")

因此继续使用tm包的设施?

2 个答案:

答案 0 :(得分:3)

我不确定你是否可以将正则表达式放在字典函数中,因为它只接受字符向量或术语 - 文档矩阵。我建议的解决方法是使用正则表达式对术语 - 文档矩阵中的术语进行子集化,然后进行单词计数:

# What I would do instead
tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE))
# subset the tdm according to the criteria
# this is where you can use regex
crit <- grep("cru", tdm$dimnames$Terms)
# have a look to see what you got
inspect(tdm[crit])
        A term-document matrix (2 terms, 20 documents)

    Non-/sparse entries: 10/30
    Sparsity           : 75%
    Maximal term length: 7 
    Weighting          : term frequency (tf)

             Docs
    Terms     127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543
      crucial   0   0   0   0   0   0   2   0   0   0   0   0   0   0   0   0   0   0
      crude     2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0   0   2
             Docs
    Terms     704 708
      crucial   0   0
      crude     0   1
# and count the number of times that criteria is met in each doc
colSums(as.matrix(tdm[crit]))
127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708 
  2   0   2   3   0   2   2   0   0   0   5   2   0   2   0   0   0   2   0   1 
# count the total number of times in all docs
sum(colSums(as.matrix(tdm[crit])))
[1] 23

如果这不是您想要的,请继续编辑您的问题,以包含一些正确代表您的实际用例的示例数据,以及您所需输出的示例。

答案 1 :(得分:2)

如果您指定valuetype = "regex",文本分析包 quanteda 允许使用正则表达式选择要素。

require(tm)
require(quanteda)
data(crude)

dfm(corpus(crude), keptFeatures = "^cru", valuetype = "regex", verbose = FALSE)
# Document-feature matrix of: 20 documents, 2 features.
# 20 x 2 sparse Matrix of class "dfmSparse"
#      features
# docs  crude crucial
#   127     2       0
#   144     0       0
#   191     2       0
#   194     3       0
#   211     0       0
#   236     2       0
#   237     0       2
#   242     0       0
#   246     0       0
#   248     0       0
#   273     5       0
#   349     2       0
#   352     0       0
#   353     2       0
#   368     0       0
#   489     0       0
#   502     0       0
#   543     2       0
#   704     0       0
#   708     1       0

另见?selectFeatures