Question

我正在使用R来阅读文字。一个段落由100个句子组成，然后将其放入列表中，列表如下：

[[1]]

[1] "WigWagCo: For #TBT here's a video of Travis McCollum (Co-Founder and COO of WigWag) at #SXSW2016

[[2]]

[1] "chrisreedfilm: RT @hammertonail: #SXSW2016 doc THE SEER: A PORTRAIT OF WENDELL BERRY gets reviewed by @chrisreedfilm 

[[3]]

[1] "iamscottrandell: RT @therevue: Take a jaunt down #MemoriesofSXSW &amp; read the stories of @JRNelsonMusic @thegillsmusic &amp; @TheBlancosMusic 
...
...

[[99]]

[1] "SunPowerTalent: SunPower #Clerical #Job: Supply Chain Intern (#Austin, TX) 

[[100]]

[1] "SunPowerTalent: #Finance #Job alert: General Ledger Accountant | SunPower

列表中的每个对象都是＆＃34;句子＆＃34;来自同一文本。如何计算本文中所有3克的频率并知道每个3克的句子是什么？

非常感谢。

Answer 1

您可以使用R包textcat（https://CRAN.R-project.org/package=textcat）。如果您的100个句子的列表被称为x，您只需执行以下操作：

library("textcat")
n3gram <- textcat_profile_db(x, n = 3)

这是包含按频率排序的3克的100个元素（对应于原始句子）的列表。有关详细信息和示例，请参阅?textcat_profile_db。

使用r计算文本中n-gram的频率

1 个答案: