使用stringr在另一个附近找到单词

时间:2017-10-25 14:13:18

标签: r dplyr stringr

我有一个简单的问题,请考虑这个例子

library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well'))

# A tibble: 2 x 1
                                  mytext
                                   <chr>
1 stackoverflow is pretty good my friend
2       but sometimes pretty bad as well

我想计算stackoverflow靠近good的次数。我使用以下正则表达式,但它不起作用。

dataframe %>%  mutate(mycount = str_count(mytext, 
 regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
                                  mytext mycount
                                   <chr>   <int>
1 stackoverflow is pretty good my friend       0
2       but sometimes pretty bad as well       0

有人能告诉我在这里缺少什么吗?

谢谢!

3 个答案:

答案 0 :(得分:1)

我也遇到了一些麻烦,我仍然不确定为什么我尝试的东西都没有用。但我在正则表达方面只是体面,而不是专家。但是,我能够让它与lookback和lookforward一起工作。

library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well',
                                   'stackoverflow one two three four five six good',
                                   'stackoverflow good'))

dataframe
dataframe %>%  mutate(mycount = str_count(mytext, 
      regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
#                                          mytext mycount
#                                           <chr>   <int>
#1         stackoverflow is pretty good my friend       1
#2               but sometimes pretty bad as well       0
#3 stackoverflow one two three four five six good       0
#4                             stackoverflow good       1

答案 1 :(得分:1)

语料库库让这很简单:

library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
                                   'but sometimes pretty bad as well'))

# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")

# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
              | text_detect(text_sub(loc$after, 1, 4), "good"))

# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)

从概念上讲,语料库将文本视为一系列令牌。该库允许您使用text_sub()命令索引这些序列。您还可以使用text_filter()更改令牌的定义。

这是一个以相同方式工作但忽略标点符号的示例:

corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
                                "But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE

loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
              | text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)

答案 2 :(得分:0)

我想我明白了

dataframe %>%  
mutate(mycount = str_count(mytext, 
                 regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))

# A tibble: 4 x 2
                                  mytext mycount
                                   <chr>   <int>
1 stackoverflow is pretty good my friend       1
2       but sometimes pretty bad as well       0
3  stackoverflow good good stackoverflow       1
4                      stackoverflowgood       0

关键是添加\W+元字符,该字符匹配字词之间的任何