R文本挖掘 - 如何识别关键字之前的单词

时间:2017-04-17 00:52:14

标签: r text count frequency text-mining

我正在使用R进行文本挖掘,我想确定一些单词在我的焦点关键字之前是否有三个或更少的单词。例如,我的焦点关键字是兼容性,我想知道单词限制在我的关键字之前是否有三个或更少的单词。因此,我希望在文本中获得有关下列组合出现次数(X =任何其他单词)的频率计数:

  • 有限兼容性
  • 有限X兼容性
  • 限制X X. 兼容性

欢迎任何建议。感谢。

2 个答案:

答案 0 :(得分:0)

这是一种使用tidytext查找跳过ngrams的方法:

library(tidyverse)
library(tidytext)

x <- 'I am working on text mining using R, I would like to identify if some words precede my focal keyword by three or fewer words. For instance, my focal keyword is compatibility and I wanted to know if the word limited precedes my keyword by three or fewer words. Thus, I wanted to get frequency count in a text regarding how many times the following combination appears (X=any other word):

limited compatibility
limited X compatibility
limited X X compatibility

Any suggestions are welcome. Thanks.'

data_frame(x) %>% 
    unnest_tokens(line, x, 'lines') %>% 
    mutate(line_number = row_number()) %>%
    unnest_tokens(ngram, line, 'skip_ngrams', n = 2, k = 2) %>% 
    filter(grepl('limited', ngram), grepl('compatibility', ngram)) 
#> # A tibble: 3 × 2
#>   line_number                 ngram
#>         <int>                 <chr>
#> 1           2 limited compatibility
#> 2           3 limited compatibility
#> 3           4 limited compatibility

答案 1 :(得分:0)

这是一种基础R和正则表达式的方法 grepRaw提供每个匹配的正则表达式模式的位置(带有参数all = TRUE)。此结果的长度提供匹配数。

d <- c("
Limited compatibility Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla maximus lobortis 
tellus quis egestas. Donec non dignissim urna. Praesent at commodo ligula. 
Cras laoreet limited compatibility interdum mi nec euismod. Ut interdum odio non sem luctus iaculis. Mauris id sapien limited X XXXX compatibility accumsan, imperdiet justo non,limited compatibility egestas felis. Morbi commodo lectus limited X compatibility scelerisque limited XXX compatibility est bibendum, vel varius tellus vulputate. Aenean dictum accumsan limited X compatibility neque limited X X compatibility sed dictum. Vivamus finibus lacus sit amet iaculis molestie. Fusce enim limited X compatibility sapien, iaculis quis leo non, pellentesque lobortis arcu. Proin commodo limited X XXX XXXXX compatibility velit placerat venenatis mattis. Limited compatibility Curabitur et laoreet ipsum. Limited compatibility
")

> length(grepRaw("Limited compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 5
> length(grepRaw("limited \\w+ compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 4
> length(grepRaw("limited (\\w+ ){2}compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 2
> length(grepRaw("limited (\\w+ ){3}compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 1

以下正则表达式匹配“有限X兼容性neque有限X X兼容性”,这不是所谓的行为

> length(grepRaw("limited (\\w+ ){6}compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 1

然后将每个“有限xx兼容性”模式放在一行上可能更安全:

d <- gsub("Limited", "\nLimited", d, ignore.case = TRUE)
d <- gsub("compatibility", "compatibility\n", d, ignore.case = TRUE)
# writeLines(d)

现在这是正确的

> length(grepRaw("limited (\\w+ ){6}compatibility", d, ignore.case = TRUE, all = TRUE))
[1] 0