我有一个简单的问题,请考虑这个例子
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# A tibble: 2 x 1
mytext
<chr>
1 stackoverflow is pretty good my friend
2 but sometimes pretty bad as well
我想计算stackoverflow
靠近good
的次数。我使用以下正则表达式,但它不起作用。
dataframe %>% mutate(mycount = str_count(mytext,
regex('stackoverflow(?:\\w+){0,5}good', ignore_case = TRUE)))
# A tibble: 2 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 0
2 but sometimes pretty bad as well 0
有人能告诉我在这里缺少什么吗?
谢谢!
答案 0 :(得分:1)
我也遇到了一些麻烦,我仍然不确定为什么我尝试的东西都没有用。但我在正则表达方面只是体面,而不是专家。但是,我能够让它与lookback和lookforward一起工作。
library(dplyr)
library(stringr)
dataframe <- data_frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well',
'stackoverflow one two three four five six good',
'stackoverflow good'))
dataframe
dataframe %>% mutate(mycount = str_count(mytext,
regex('(?<=stackoverflow)\\s(?:\\w+\\s){0,5}(?=good)', ignore_case = TRUE)))
## A tibble: 4 x 2
# mytext mycount
# <chr> <int>
#1 stackoverflow is pretty good my friend 1
#2 but sometimes pretty bad as well 0
#3 stackoverflow one two three four five six good 0
#4 stackoverflow good 1
答案 1 :(得分:1)
语料库库让这很简单:
library(corpus)
dataframe <- data.frame(mytext = c('stackoverflow is pretty good my friend',
'but sometimes pretty bad as well'))
# find instances of 'stackoverflow'
loc <- text_locate(dataframe$mytext, "stackoverflow")
# count the number of times 'good' is within 5 tokens
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
# aggregate over text
count <- tapply(near_good, loc$text, sum, default = 0)
从概念上讲,语料库将文本视为一系列令牌。该库允许您使用text_sub()
命令索引这些序列。您还可以使用text_filter()
更改令牌的定义。
这是一个以相同方式工作但忽略标点符号的示例:
corpus <- corpus_frame(text = c("Stackoverflow, is pretty (?) GOOD my friend!",
"But sometimes pretty bad as well"))
text_filter(corpus)$drop_punct <- TRUE
loc <- text_locate(corpus, "stackoverflow")
near_good <- (text_detect(text_sub(loc$before, -4, -1), "good")
| text_detect(text_sub(loc$after, 1, 4), "good"))
count <- tapply(near_good, loc$text, sum, default = 0)
答案 2 :(得分:0)
我想我明白了
dataframe %>%
mutate(mycount = str_count(mytext,
regex('stackoverflow\\W+(?:\\w+ ){0,5}good', ignore_case = TRUE)))
# A tibble: 4 x 2
mytext mycount
<chr> <int>
1 stackoverflow is pretty good my friend 1
2 but sometimes pretty bad as well 0
3 stackoverflow good good stackoverflow 1
4 stackoverflowgood 0
关键是添加\W+
元字符,该字符匹配字词之间的任何。