Question

我正在尝试获取此页面上单词的DTM：

https://en.wikipedia.org/wiki/Talk:Libyan_Civil_War_(2011)/Archive_1

我的问题是，即使我在NULL上设置字典，发布人员（即我的语料库中的单词）的伪数也从未出现在我的DTM中。例如，我希望＆＃34; Lihaas＆＃34;被发现31次，但它没有出现在我的DTM上。

我的代码：

library(tm)
docs<- VCorpus(DirSource(directory = "~dir"))
docsTDM <- DocumentTermMatrix(docs, control=list(dictionary=NULL))

我获得：

          the          2011      february           utc 
          628           319           293           280 
         talk           and          this          that 
          236           197           163           152 
          for           are           not      uprising 
          106           101            92            79 
       libyan      protests           but       support 
           76            75            68            68 
         with         there        revolt           its 
           68            65            62            61 
      protest       article          have           now 
           58            57            53            50 
          has         civil        should         which 
           47            46            44            44 
         more         think           war           was 
           43            43            41            41 
         from         libya          what         would 
           40            40            36            35 
        about    revolution         added       sources 
           34            34            32            32 
      comment    government        people          some 
           30            30            30            30 
          all          just       section           you 
           29            29            29            29 
         than      unsigned          will           can 
           27            27            27            26 
talk•contribs          then          even          name 
           26            26            25            25

Answer 1

这可能与“Lihaas”与前一个“。”相邻的事实有关。在我看到的所有情况中，或在括号内。所以很可能是由于 tm 的tokeniser问题。

这是一种使用 quanteda 包产生您想要的产品的替代方案。

# read the document using the readtext package
wikitxt <- readtext::readtext("Talk:Libyan Civil War (2011):Archive 1 - Wikipedia.html")

library("quanteda")
wikidfm <- dfm(corpus(wikitxt), tolower = FALSE)
wikidfm
## Document-feature matrix of: 1 document, 3,006 features (0% sparse).

wikidfm[, c("lihaas", "Lihaas")]
## Document-feature matrix of: 1 document, 2 features (0% sparse).
## 1 x 2 sparse Matrix of class "dfm"
##        features
## docs    lihaas Lihaas
##   text1      1     30

如何在R中的DTM中包含伪？

1 个答案: