我正在做一些文本挖掘,正在做一个最常用词的简单列表。我正在删除停用词以及其他一些数据整理。这是语料库以及矩阵和排序:
# Preliminary corpus
corpusNR <- Corpus(VectorSource(nat_registry)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stripWhitespace) %>%
tm_map(stemDocument)
# Create term-document matrices & remove sparse terms
tdmNR <- DocumentTermMatrix(corpusNR) %>%
removeSparseTerms(1 - (5/length(corpusNR)))
# Calculate and sort by word frequencies
word.freqNR <- sort(colSums(as.matrix(tdmNR)),
decreasing = T)
# Create frequency table
tableNR <- data.frame(word = names(word.freqNR),
absolute.frequency = word.freqNR,
relative.frequency =
word.freqNR/length(word.freqNR))
# Remove the words from the row names
rownames(tableNR) <- NULL
# Show the 10 most common words
head(tableNR, 10)
以下是我最后得到的前十个词:
> head(tableNR, 10)
word absolute.frequency relative.frequency
1 vaccin 95 0.4822335
2 program 82 0.4162437
3 covid 59 0.2994924
4 health 59 0.2994924
5 educ 55 0.2791878
6 extens 53 0.2690355
7 inform 49 0.2487310
8 communiti 42 0.2131980
9 provid 41 0.2081218
10 counti 36 0.1827411
请注意,许多单词的一部分被剪掉了#1 应该是“vaccine”,#5 应该是“education”……等等。
任何想法为什么会发生这种情况?提前致谢。