Question

我正在使用str库处理R中的文本挖掘。我知道如何提取关键字，但我想提取文档中经常“在一起”的单词的关联（例如，获取表达式to_chars或tm）。

我知道有proof of concept功能，但是如果您想知道与特定单词相关的单词，它似乎只是相关...我想自动检测“链接在一起的单词串联”。 / p>

这是使用hydraulic jack库执行此操作的方法吗？或者以任何其他方式在R？

提前致谢

编辑：使用findAssocs包，尤其是tm函数，我收到错误（请参阅）。该函数表示它需要一个数据帧，但Quanteda已经是一个......）

Answer 1

您正在寻找的术语是共同出现的。

我知道有两个可以帮助你的软件包。

包quanteda：fcm函数创建稀疏要素共生矩阵
包udpipe：cooccurence函数创建一个cooccurence data.frame，指示每个术语与另一个术语共同出现的次数。

根据您的需要选择其中一种。

根据操作编辑

进行编辑

您的DF不是dfm对象。它看起来像data.frame。 tidytext具有将data.frame转换为dfm以便在quanteda中使用的功能。

library(quanteda)
DF <- data.frame(term = c("anthony", "choonheyt", "construction", "direction"),
                 document = c(1,1,2,2),
                 count = c(1,1,1,1), stringsAsFactors = FALSE)


# cast as dfm from tidytext
x <- tidytext::cast_dfm(DF, document, term, count)
x
Document-feature matrix of: 2 documents, 4 features (50% sparse).
2 x 4 sparse Matrix of class "dfm"
    features
docs anthony choonheyt construction direction
   1       1         1            0         0
   2       0         0            1         1

fcm(x, context = "document", count = "frequency")

Feature co-occurrence matrix of: 4 by 4 features.
4 x 4 sparse Matrix of class "fcm"
              features
features       anthony choonheyt construction direction
  anthony            0         1            0         0
  choonheyt          0         0            0         0
  construction       0         0            0         1
  direction          0         0            0         0

Answer 2

对于udpipe R包。有一个小插图可以处理这个问题：https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html

可能是关键概念吗？

2 个答案:

根据操作编辑