我正在尝试获取此页面上单词的DTM:
https://en.wikipedia.org/wiki/Talk:Libyan_Civil_War_(2011)/Archive_1
我的问题是,即使我在NULL上设置字典,发布人员(即我的语料库中的单词)的伪数也从未出现在我的DTM中。例如,我希望" Lihaas"被发现31次,但它没有出现在我的DTM上。
我的代码:
library(tm)
docs<- VCorpus(DirSource(directory = "~dir"))
docsTDM <- DocumentTermMatrix(docs, control=list(dictionary=NULL))
我获得:
the 2011 february utc
628 319 293 280
talk and this that
236 197 163 152
for are not uprising
106 101 92 79
libyan protests but support
76 75 68 68
with there revolt its
68 65 62 61
protest article have now
58 57 53 50
has civil should which
47 46 44 44
more think war was
43 43 41 41
from libya what would
40 40 36 35
about revolution added sources
34 34 32 32
comment government people some
30 30 30 30
all just section you
29 29 29 29
than unsigned will can
27 27 27 26
talk•contribs then even name
26 26 25 25
答案 0 :(得分:0)
这可能与“Lihaas”与前一个“。”相邻的事实有关。在我看到的所有情况中,或在括号内。所以很可能是由于 tm 的tokeniser问题。
这是一种使用 quanteda 包产生您想要的产品的替代方案。
# read the document using the readtext package
wikitxt <- readtext::readtext("Talk:Libyan Civil War (2011):Archive 1 - Wikipedia.html")
library("quanteda")
wikidfm <- dfm(corpus(wikitxt), tolower = FALSE)
wikidfm
## Document-feature matrix of: 1 document, 3,006 features (0% sparse).
wikidfm[, c("lihaas", "Lihaas")]
## Document-feature matrix of: 1 document, 2 features (0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
## features
## docs lihaas Lihaas
## text1 1 30