用户在R中随时间变化的频率

时间:2015-10-22 02:11:50

标签: r tm

我的目标是随时间推移bump chart字频率。我有大约36000个用户评论和相关日期的个人条目。我在这里有25个用户样本:http://pastebin.com/kKfby5kf

我试图在给定日期获得最频繁的单词(可能是前10名?)。我觉得我的方法很接近,但不太正确:

    library("tm")

frequencylist <- list(0)

for(i in unique(sampledf[,2])){

  subset <- subset(sampledf, sampledf[,2]==i)

  comments <- as.vector(subset[,1])
  verbatims <- Corpus(VectorSource(comments))
  verbatims <- tm_map(verbatims, stripWhitespace)
  verbatims <- tm_map(verbatims, content_transformer(tolower))
  verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
  verbatims <- tm_map(verbatims, removePunctuation)

  stopwords2 <- c("game")
  verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
  dtm <- DocumentTermMatrix(verbatims2)
  dtm2 <- as.matrix(dtm)
  frequency <- colSums(dtm2)
  frequency <- sort(frequency, decreasing=TRUE)
  frequencydf <- data.frame(frequency)
  frequencydf$comments <- row.names(frequencydf)
  frequencydf$date <- i

  frequencylist[[i]] <- frequencydf 
}

解释我的疯狂:pastebin示例进入samplesf。对于样本中的每个唯一日期,我都试图获得单词频率。然后,我尝试将列表中的单词频率存储在列表中(尽管可能不是最好的方法)。首先,我按日期进行子集,然后删除空白,常用英语单词,标点符号和小写。然后我再次为#34;游戏&#34;因为它不是太有趣但很常见。为了获得单词频率,我将其传递到文档术语矩阵并执行简单的colSums()。然后我附加该表的日期并尝试将其存储在列表中。

我不确定我的策略是否有效。有没有更简单,更好的方法解决这个问题?

1 个答案:

答案 0 :(得分:0)

评论者是正确的,因为有更好的方法来设置可重现的示例。此外,您的答案可能更具体地在您尝试作为输出完成的内容中。 (我无法在没有错误的情况下执行您的代码。)

然而:你要求一种更简单,更好的方法。以下是我认为的两种情况。它使用 quanteda 文本包,并在创建文档特征矩阵时利用groups功能。然后它在“dfm”上执行一些排名,以获得您在日常排名方面所需的内容。

请注意,这取决于我使用read.delim("sampledf.tsv", stringsAsFactors = FALSE)加载了您的关联数据。

require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip, 
                   docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))

# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete. 
## Elapsed time: 0.009 seconds.

myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types

# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
##              â great nice play  go will can get ever first
## 2013-10-02   1    18   19   20  21   22  23  24   25    26
## 2013-10-04   3     1    2    4   5    6   7   8    9    10
## 2013-10-05   3     9   28   29   1    2   4   5    6     7
## 2013-10-06   7     4    8   10  11   30  31  32   33    34
## 2013-10-07   5     1    2    3   4    6   7   8    9    10
## 2013-10-09  12    42   43    1   2    3   4   5    6     7
## 2013-10-13   1    14    6    9  10   13  44  45   46    47
## 2013-10-16   2     3   84   85   1    4   5   6    7     8
## 2013-10-18  15     1    2    3   4    5   6   7    8     9
## 2013-10-19   3    86    1    2   4    5   6   7    8     9
## 2013-10-22   2    87   88   89  90   91  92  93   94    95
## 2013-10-23  13    98   99  100 101  102 103 104  105   106
## 2013-10-25   4     6    5   12  16  109 110 111  112   113
## 2013-10-27   8     4    6   15  17  124 125 126  127   128
## 2013-10-30  11     1    2    3   4    5   6   7    8     9
## 2014-10-01   7    16  139    1   2    3   4   5    6     8
## 2014-10-02 140     1    2    3   4    5   6   7    8     9
## 2014-10-03 141   142  143    1   2    3   4   5    6     7
## 2014-10-05 144   145  146  147 148    1   2   3    4     5
## 2014-10-06  17   149  150    1   2    3   4   5    6     7

# top n features by day
n <- 10 
as.data.frame(apply(featureRanksByDate, 1, function(x) {
    todaysTopFeatures <- names(featureRanksByDate)
    names(todaysTopFeatures) <- x
    todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
##    2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1           â      great         go     triple      great       play          â         go      great       nice       year       year
## 2         win       nice       will      niple       nice         go    created          â       nice       play          â       give
## 3        year          â          â   backflip       play       will      wasnt      great       play          â       give       good
## 4        give       play        can      great         go        can      money       will         go         go       good       hard
## 5        good         go        get      scope          â        get     prizes        can       will       will       hard       time
## 6        hard       will       ever       ball       will       ever       nice        get        can        can       time     triple
## 7        time        can      first          â        can      first      piece       ever        get        get     triple      niple
## 8      triple        get        fun       nice        get        fun       dead      first       ever       ever      niple   backflip
## 9       niple       ever      great   testical       ever        win       play        fun      first      first   backflip      scope
## 10   backflip      first        win       play      first       year         go        win        fun        fun      scope       ball
##    2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1       scope      scope      great       play      great       play       will       play
## 3    testical   testical       play       will       play       will        get       will
## 2        ball       ball       nice         go       nice         go        can         go
## 4           â      great         go        can         go        can       ever        can
## 5        nice       shot       will        get       will        get      first        get
## 6       great       nice        can       ever        can       ever        fun       ever
## 7        shot       head        get          â        get      first        win      first
## 8        head          â       ever      first       ever        fun       year        fun
## 9     dancing    dancing      first        fun      first        win       give        win
## 10        cow        cow        fun        win        fun       year       good       year

BTW有趣的 niple testical 的拼写。