我需要的是一个在某个“单词距离”内查找单词的功能。 “他的车上有一袋工具”这句话很有趣。
使用Quanteda kwic函数,我可以分别找到“ bag”和“ tool”,但这常常使我产生过多的结果。我需要彼此之间五个字之内的“袋子”和“工具”。
答案 0 :(得分:0)
您可以使用fcm()
函数来计算固定窗口(例如5个单词)中的同时出现。这将创建一个“功能共现矩阵”,并且可以针对任何大小的令牌范围或整个文档的上下文进行定义。
对于您的示例,或者至少是基于我对您的问题的解释的示例,这看起来像:
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
txt <- c(
d1 = "He had a bag of tools in his car",
d2 = "bag other other other other tools other"
)
fcm(txt, context = "window", window = 5)
## Feature co-occurrence matrix of: 10 by 10 features.
## 10 x 10 sparse Matrix of class "fcm"
## features
## features He had a bag of tools in his car other
## He 0 1 1 1 1 1 0 0 0 0
## had 0 0 1 1 1 1 1 0 0 0
## a 0 0 0 1 1 1 1 1 0 0
## bag 0 0 0 0 1 2 1 1 1 4
## of 0 0 0 0 0 1 1 1 1 0
## tools 0 0 0 0 0 0 1 1 1 5
## in 0 0 0 0 0 0 0 1 1 0
## his 0 0 0 0 0 0 0 0 1 0
## car 0 0 0 0 0 0 0 0 0 0
## other 0 0 0 0 0 0 0 0 0 10
在这里,术语 bag 在第一个文档的 tool 的5个标记中出现一次。在第二份文档中,它们之间的距离超过了5个令牌,因此不计算在内。