Question

问题：我正在使用标记生成器进行文本挖掘，并希望限制输入数据中的字符串长度。以下代码保留了整个字符串，如果包含该单词。

    #create data frame with data 
    dd <- data.frame(
    text = c("hello how are you doing thank 
              you for helping me with this 
              problem","junk","junk"), stringsAsFactors = F)

   #keep string that only include term "how"
   dd <- filter(dd, grepl('how', text))

问题：我如何修改代码，只在关键字后面保留N个字。

e.g。

如果N = 1，那么dd将包括：如何

如果N = 2，那么dd将包括：你好吗

如果N = 3，则dd将包括：你好吗

...

如果我还在keep中添加了其他单词，我需要能够运行的代码：

   #keep string that only include terms "how" and "with"
   dd <- filter(dd, grepl('how|with', text))

Answer 1

这是一个可能的方法与整洁的文本挖掘包：（所以检查依赖关系......-

library(tidytext) # install.packages("tidytext")
library(tidyr)    # install.packages("tidyr")
library(dplyr)    # install.packages("dplyr")

dd <- data.frame(
  text = c("hello how are you doing thank 
              you for helping me with this 
              problem","junk","junk"), stringsAsFactors = F)

我提到scope关于单词视界的参数;很容易将以下代码转换为函数：

scope=2
dd %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>% 
  separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>% 
  filter(word1 %in% c("how","me")) 

# A tibble: 2 × 3
  word1 word2 word3
  <chr> <chr> <chr>
1   how   are   you
2    me  with  this

如果你想结束字符串，你必须折叠ngrams，比如第二个例子：

scope=3
dd %>%
unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>% 
  separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ") %>% 
  filter(word1 %in% c("how"))  %>% apply(.,1,paste, collapse= " ")

[1] "how are you doing"

关于你的评论：现在，如果要为每个块（字符串）处理块（字符串），则必须通过处理明确地执行此组。这是一种方式：

scope=2
subsets <- 
    dd %>% 
    mutate(id=1:length(text)) %>%
    split(., .$id) 

unlist(lapply(subsets, function(dd) {
  dd %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 1+scope) %>% 
  separate(ngram, paste("word",1:(scope+1),sep=""), sep = " ")  %>%
  filter(word1 %in% c("how","problem")) %>%
  apply(.,1,FUN=function(vec) paste(vec[-1],collapse=" "))
}))

           1 
"how are you"

在字符串R中保留关键字后的单词

1 个答案: