在关键词的左右两侧提取单词

时间:2016-07-08 15:39:57

标签: r

进一步深入研究文本挖掘并让客户最近询问是否可以获得前面的5个单词并继续进行关键术语。实施例...

为了获得绕口令的全部效果,您应该尽快重复几次,不要绊倒或错误发声。

Key term=twisters
Preceding 5 words=the full effect of tongue
Proceeding 5 words=you should repeat them several

长期计划是采用10个最常用的术语,以及前面和前面的单词,然后加载到data.frame中。我用gsub捅了一下但是没有用。

任何想法,指导等都将不胜感激。

3 个答案:

答案 0 :(得分:3)

您可以使用word中的stringr

library(stringr)
ind <- sapply(strsplit(x, ' '), function(i) which(i == 'twisters'))
word(x, ind-5, ind-1)
#[1] "the full effect of tongue"
word(x, ind+1, ind+5)
#[1] "you should repeat them several"

答案 1 :(得分:2)

quanteda包具有专门用于在上下文中返回关键字的函数:kwic。它正在使用stringi

library(quanteda)
kwic(txt, keywords = "twisters", window = 5, case_insensitive = TRUE)
#                            contextPre  keyword                      contextPost
#[text1, 8] the full effect of tongue [ twisters ] you should repeat them several
#[text2, 2]                       The [ twisters ] are always twisting           
#[text3, 9]  for those guys, they are [ twisters ] of words and will tell
#[text4, 1]                           [ Twisters ] will ruin your life. 

示例文本:

# sample text
txt <- c("To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing.",
         "The twisters are always twisting",
         "watch out for those guys, they are twisters of words and will tell a yarn a mile long",
         "Twisters will ruin your life.")

答案 2 :(得分:0)

使用strsplit将字符串拆分为一个向量,然后使用grep获取正确的索引。如果你经常这么做,你应该将它包装在一个函数中。

x <- "To get the full effect of tongue twisters you should repeat them several times, as quickly as possible, without stumbling or mispronouncing."
x_split <- strsplit(x, " ")[[1]]
key <- "twisters"
key_index <- grep(key, x)
before <- x_split[(key_index - 5):(key_index - 1)]
after <- x_split[(key_index + 1):(key_index + 5)]
before
#[1] "the"    "full"   "effect" "of"     "tongue"
after
#[1] "you"     "should"  "repeat"  "them"    "several"
paste(before, collapse = " ")
#[1] "the full effect of tongue"
paste(after, collapse = " ")
#[1] "you should repeat them several"