我是R的新手,也是正则表达式的新手。我在其他讨论中找到了这个,但无法找到合适的匹配。
我有一个大型的文本数据集(书)。我使用以下代码来描述本文中的字词:
> a <- gregexpr("[a-zA-Z0-9'\\-]+", book[1])
> regmatches (book[1], a)
[[1]]
[1] "she" "runs"
我现在想要将整个数据集(书)中的所有文本拆分成单个单词,以便我可以确定整个文本中前十个单词的内容(标记它)。然后,我需要计算使用表函数对单词进行计数,然后按某种方式排序以获得前十名。
此外,有关如何计算累积分布的任何想法,即需要多少单词来覆盖所有单词的一半(50%)?
非常感谢您的回复以及您对我的基本问题的耐心。
答案 0 :(得分:5)
不是正则表达式,但可能更多的是你所追求的事情,而不是大惊小怪......这是使用托马斯数据的qdap
方法(PS漂亮的数据方法):
u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))
library(qdap)
freq_terms(book, 10)
## WORD FREQ
## 1 the 18195
## 2 of 12015
## 3 to 7177
## 4 and 5191
## 5 in 4518
## 6 a 4051
## 7 be 3846
## 8 that 2800
## 9 it 2565
## 10 is 2218
这样做的好处是你可以控制:
stopwords
at.least
extend = TRUE
(默认)这里再次使用停用词和最小长度设置(通常这两个参数重叠,因为停用词往往是最小长度的单词)和一个情节:
(ft <- freq_terms(book, 10, at.least=3, stopwords=qdapDictionaries::Top25Words))
plot(ft)
## WORD FREQ
## 1 which 2075
## 2 would 1273
## 3 will 1257
## 4 not 1238
## 5 their 1098
## 6 states 864
## 7 may 839
## 8 government 830
## 9 been 798
## 10 state 792
答案 1 :(得分:4)
获得词频:
> mytext = c("This","is","a","test","for","count","of","the","words","The","words","have","been","written","very","randomly","so","that","the","test","can","be","for","checking","the","count")
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that The This very
3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
written
1
忽略大小写:
> mytext = tolower(mytext)
>
> sort(table(mytext), decreasing=T)
mytext
the count for test words a be been can checking have is of randomly so that this very written
4 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
>
仅限前十个单词:
> sort(table(mytext), decreasing=T)[1:10]
mytext
the count for test words a be been can checking
4 2 2 2 2 1 1 1 1 1
答案 2 :(得分:4)
您可以使用正则表达式,但使用文本挖掘包将为您提供更多灵活性。例如,要进行基本的单词分隔,只需执行以下操作:
u <- "http://www.gutenberg.org/cache/epub/1404/pg1404.txt"
library("httr")
book <- httr::content(GET(u))
w <- strsplit(book, "[[:space:]]+")[[1]]
tail(sort(table(w)), 10)
# w
# which is that be a in and to of the
# 1968 1995 2690 3766 3881 4184 4943 6905 11896 16726
但是,如果你想,例如,能够删除常见的停用词或更好地处理大写(在上面,将意味着Hello和hello不计算在一起),你应该深入研究 tm < /强>:
library("tm")
s <- URISource(u)
corpus <- VCorpus(s)
m <- DocumentTermMatrix(corpus)
findFreqTerms(m, 600) # words appearing more than 600 times
# "all" "and" "are" "been" "but" "for" "from" "have" "its" "may"
# "not" "that" "the" "their" "they" "this" "which" "will" "with" "would"
c2 <- tm_map(corpus, removeWords, stopwords("english"))
m2 <- DocumentTermMatrix(c2)
findFreqTerms(m2, 400) # words appearing more than 500 times
# [1] "can" "government" "may" "must" "one" "power" "state" "the" "will"