我的目标是随时间推移bump chart字频率。我有大约36000个用户评论和相关日期的个人条目。我在这里有25个用户样本:http://pastebin.com/kKfby5kf
我试图在给定日期获得最频繁的单词(可能是前10名?)。我觉得我的方法很接近,但不太正确:
library("tm")
frequencylist <- list(0)
for(i in unique(sampledf[,2])){
subset <- subset(sampledf, sampledf[,2]==i)
comments <- as.vector(subset[,1])
verbatims <- Corpus(VectorSource(comments))
verbatims <- tm_map(verbatims, stripWhitespace)
verbatims <- tm_map(verbatims, content_transformer(tolower))
verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
verbatims <- tm_map(verbatims, removePunctuation)
stopwords2 <- c("game")
verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
dtm <- DocumentTermMatrix(verbatims2)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
frequencydf <- data.frame(frequency)
frequencydf$comments <- row.names(frequencydf)
frequencydf$date <- i
frequencylist[[i]] <- frequencydf
}
解释我的疯狂:pastebin示例进入samplesf。对于样本中的每个唯一日期,我都试图获得单词频率。然后,我尝试将列表中的单词频率存储在列表中(尽管可能不是最好的方法)。首先,我按日期进行子集,然后删除空白,常用英语单词,标点符号和小写。然后我再次为#34;游戏&#34;因为它不是太有趣但很常见。为了获得单词频率,我将其传递到文档术语矩阵并执行简单的colSums()
。然后我附加该表的日期并尝试将其存储在列表中。
我不确定我的策略是否有效。有没有更简单,更好的方法解决这个问题?
答案 0 :(得分:0)
评论者是正确的,因为有更好的方法来设置可重现的示例。此外,您的答案可能更具体地在您尝试作为输出完成的内容中。 (我无法在没有错误的情况下执行您的代码。)
然而:你要求一种更简单,更好的方法。以下是我认为的两种情况。它使用 quanteda 文本包,并在创建文档特征矩阵时利用groups
功能。然后它在“dfm”上执行一些排名,以获得您在日常排名方面所需的内容。
请注意,这取决于我使用read.delim("sampledf.tsv", stringsAsFactors = FALSE)
加载了您的关联数据。
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip,
docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))
# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete.
## Elapsed time: 0.009 seconds.
myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types
# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
## â great nice play go will can get ever first
## 2013-10-02 1 18 19 20 21 22 23 24 25 26
## 2013-10-04 3 1 2 4 5 6 7 8 9 10
## 2013-10-05 3 9 28 29 1 2 4 5 6 7
## 2013-10-06 7 4 8 10 11 30 31 32 33 34
## 2013-10-07 5 1 2 3 4 6 7 8 9 10
## 2013-10-09 12 42 43 1 2 3 4 5 6 7
## 2013-10-13 1 14 6 9 10 13 44 45 46 47
## 2013-10-16 2 3 84 85 1 4 5 6 7 8
## 2013-10-18 15 1 2 3 4 5 6 7 8 9
## 2013-10-19 3 86 1 2 4 5 6 7 8 9
## 2013-10-22 2 87 88 89 90 91 92 93 94 95
## 2013-10-23 13 98 99 100 101 102 103 104 105 106
## 2013-10-25 4 6 5 12 16 109 110 111 112 113
## 2013-10-27 8 4 6 15 17 124 125 126 127 128
## 2013-10-30 11 1 2 3 4 5 6 7 8 9
## 2014-10-01 7 16 139 1 2 3 4 5 6 8
## 2014-10-02 140 1 2 3 4 5 6 7 8 9
## 2014-10-03 141 142 143 1 2 3 4 5 6 7
## 2014-10-05 144 145 146 147 148 1 2 3 4 5
## 2014-10-06 17 149 150 1 2 3 4 5 6 7
# top n features by day
n <- 10
as.data.frame(apply(featureRanksByDate, 1, function(x) {
todaysTopFeatures <- names(featureRanksByDate)
names(todaysTopFeatures) <- x
todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
## 2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1 â great go triple great play â go great nice year year
## 2 win nice will niple nice go created â nice play â give
## 3 year â â backflip play will wasnt great play â give good
## 4 give play can great go can money will go go good hard
## 5 good go get scope â get prizes can will will hard time
## 6 hard will ever ball will ever nice get can can time triple
## 7 time can first â can first piece ever get get triple niple
## 8 triple get fun nice get fun dead first ever ever niple backflip
## 9 niple ever great testical ever win play fun first first backflip scope
## 10 backflip first win play first year go win fun fun scope ball
## 2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1 scope scope great play great play will play
## 3 testical testical play will play will get will
## 2 ball ball nice go nice go can go
## 4 â great go can go can ever can
## 5 nice shot will get will get first get
## 6 great nice can ever can ever fun ever
## 7 shot head get â get first win first
## 8 head â ever first ever fun year fun
## 9 dancing dancing first fun first win give win
## 10 cow cow fun win fun year good year
BTW有趣的 niple 和 testical 的拼写。