除了问题R Text mining - how to change texts in R data frame column into several columns with word frequencies?之外,我想知道如何设法使用bigrams频率而不仅仅是字频。 再次,非常感谢提前!
这是示例数据框(感谢Tyler Rinker)。
person sex adult state code
1 sam m 0 Computer is fun. Not too fun. K1
2 greg m 0 No it's not, it's dumb. K2
3 teacher m 1 What should we do? K3
4 sam m 0 You liar, it stinks! K4
5 greg m 0 I am telling the truth! K5
6 sally f 0 How can we be certain? K6
7 greg m 0 There is no way. K7
8 sam m 0 I distrust you. K8
9 sally f 0 What are you talking about? K9
10 researcher f 1 Shall we move on? Good then. K10
11 greg m 0 I'm hungry. Let's eat. You already? K11
上面的数据集:
library(qdap); DATA
答案 0 :(得分:3)
qdap
的开发版本(应该在接下来的几天内转到CRAN)会生成ngrams。现在,您需要使用dev version。在玩具数据集上,这很快,但在较大的数据集上,例如qdap
的{{1}}数据集需要约5分钟才能完成。你可以:
以下是获取mraja1
开发版并运行二元搜索的代码:
qdap