如何计算单词与文档中特定术语的接近程度

时间:2017-05-18 20:57:22

标签: r tm quanteda

我试图找出一种方法来计算文档中特定术语的单词邻近度以及平均接近度(按单词)。我知道在SO上有类似的问题,但没有任何东西可以给我我需要的答案,甚至可以指出我有用的地方。所以我要说我有以下文字:

song <- "Far over the misty mountains cold To dungeons deep and caverns old We 
must away ere break of day To seek the pale enchanted gold. The dwarves of 
yore made mighty spells, While hammers fell like ringing bells In places deep, 
where dark things sleep, In hollow halls beneath the fells. For ancient king 
and elvish lord There many a gleaming golden hoard They shaped and wrought, 
and light they caught To hide in gems on hilt of sword. On silver necklaces 
they strung The flowering stars, on crowns they hung The dragon-fire, in 
twisted wire They meshed the light of moon and sun. Far over the misty 
mountains cold To dungeons deep and caverns old We must away, ere break of 
day, To claim our long-forgotten gold. Goblets they carved there for 
themselves And harps of gold; where no man delves There lay they long, and 
many a song Was sung unheard by men or elves. The pines were roaring on the 
height, The winds were moaning in the night. The fire was red, it flaming 
spread; The trees like torches blazed with light. The bells were ringing in 
the dale And men they looked up with faces pale; The dragon’s ire more fierce 
than fire Laid low their towers and houses frail. The mountain smoked beneath 
the moon; The dwarves they heard the tramp of doom. They fled their hall to 
dying fall Beneath his feet, beneath the moon. Far over the misty mountains 
grim To dungeons deep and caverns dim We must away, ere break of day,
To win our harps and gold from him!"

我希望能够看到“火”这个词的两边(我希望这个数字可以互换)中出现的单词(左边15和左边15)(也可以互换)每次出现。我希望看到每个单词以及它在每个“火”实例中出现在15个单词范围内的次数。因此,例如,“火”使用3次。在这三次中,“光”一词落在两边15字以内。我想要一个表格,显示单词,它在指定的15附近出现的次数,最大距离(在这种情况下是12),最小距离(7)和平均距离(是9.5)。

我想我需要几个步骤和包来完成这项工作。我的第一个想法是使用quanteda的“kwic”功能,因为它允许您选择围绕特定术语的“窗口”。然后,基于kwic结果的术语的频率计数并不那么难(对于频率移除了停用词,但是对于单词接近度量没有移除停用词)。我真正的问题是从焦点项找到最大,最小和平均距离,然后将结果变成一个漂亮的整齐表,其中的术语按频率降序排列,列给出频率计数,最大距离,最小值距离和平均距离。

这是我到目前为止所做的:

library(quanteda)
library(tm)

mysong <- char_tolower(song)

toks <- tokens(mysong, remove_hyphens = TRUE, remove_punct = TRUE, 
remove_numbers = TRUE, remove_symbols = TRUE)

mykwic <- kwic(toks, "fire", window = 15, valuetype ="fixed")
thekwic <- as.character(mykwic)

thekwic <- removePunctuation(thekwic)
thekwic <- removeNumbers(thekwic)
thekwic <- removeWords(thekwic, stopwords("en"))

kwicFreq <- termFreq(thekwic)

非常感谢任何帮助。

2 个答案:

答案 0 :(得分:3)

我建议使用tidytextfuzzyjoin个软件包来解决这个问题。

您可以先将其标记为每个字一行的数据框,添加position列,然后删除停用词:

library(tidytext)
library(dplyr)

all_words <- data_frame(text = song) %>%
  unnest_tokens(word, text) %>%
  mutate(position = row_number()) %>%
  filter(!word %in% tm::stopwords("en"))

然后,您可以找到单词fire,并使用fuzzyjoin中的difference_inner_join()来查找这些行的15个字内的所有行。然后,您可以使用group_by()summarize()获取每个单词所需的统计信息。

library(fuzzyjoin)

nearby_words <- all_words %>%
  filter(word == "fire") %>%
  select(focus_term = word, focus_position = position) %>%
  difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 15) %>%
  mutate(distance = abs(focus_position - position))

words_summarized <- nearby_words %>%
  group_by(word) %>%
  summarize(number = n(),
            maximum_distance = max(distance),
            minimum_distance = min(distance),
            average_distance = mean(distance)) %>%
  arrange(desc(number))

在这种情况下输出:

# A tibble: 49 × 5
       word number maximum_distance minimum_distance average_distance
      <chr>  <int>            <dbl>            <dbl>            <dbl>
 1     fire      3                0                0              0.0
 2    light      2               12                7              9.5
 3     moon      2               13                9             11.0
 4    bells      1               14               14             14.0
 5  beneath      1               11               11             11.0
 6   blazed      1               10               10             10.0
 7   crowns      1                5                5              5.0
 8     dale      1               15               15             15.0
 9   dragon      1                1                1              1.0
10 dragon’s      1                5                5              5.0
# ... with 39 more rows

请注意,此方法还允许您一次对多个焦点词进行分析。您所要做的就是将filter(word == "fire")更改为filter(word %in% c("fire", "otherword")),并将group_by(word)更改为group_by(focus_word, word)

答案 1 :(得分:2)

tidytext 答案很好,但 quanteda 中的工具可以针对此进行调整。在窗口内计数的主要功能不是kwic(),而是fcm()(特征共现矩阵)。

require(quanteda)

# tokenize so that intra-word hyphens and punctuation are removed
toks <- tokens(song, remove_punct = TRUE, remove_hyphens = TRUE)

# all co-occurrences
head(fcm(toks, window = 15, context = "window", count = "frequency")[, "fire"])
## Feature co-occurrence matrix of: 155 by 1 feature.
## (showing first 6 documents and first feature)
##            features
## features    fire
##   Far          1
##   over         1
##   the          5
##   misty        1
##   mountains    0
##   cold         0

head(fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"])
## Feature co-occurrence matrix of: 1 by 1 feature.
## 1 x 1 sparse Matrix of class "fcm"
##         features
## features fire
##    light    2

要获得目标中单词的平均距离,需要对距离的权重函数进行一些修改。下面,应用权重来根据位置考虑计数,当它们相加时提供加权平均值,然后除以窗口内的总频率。例如,对于&#34; light&#34;的例子:

# average distance
fcm(toks, window = 15, context = "window", count = "weighted", weights = 1:15)["light", "fire"] /
    fcm(toks, window = 15, context = "window", count = "frequency")["light", "fire"]
## 1 x 1 Matrix of class "dgeMatrix"
##         features
##    light  9.5
## features fire

获得最小和最大位置有点复杂,虽然我可以找到一种方法来攻击&#34; hack&#34;这使用权重的组合在每个位置定位二进制掩码然后将其转换为距离。 (太笨拙地展示,所以除非我想到更优雅的方式,否则我建议整洁的解决方案。)