我需要提取在术语之前和之后出现的n个单词,用于我正在处理的文本分析。以下是一个可重复的例子:
a <- c("The day was nice and dry, when she came for our game we were ready and then she left.",
"The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes.",
"The day was nice and dry, when she came, we were not here. Our game was not completed timely, but it was completed after one hour.")
以下是我使用的功能,但它不适用于在单词或双重空格周围有标点符号的情况。
gsub(".*(( \\w{1,}){3} game( \\w{1,}){3}).*", "\\1", a, perl = TRUE)
[1] " came for our game we were ready"
[2] "The day was nice and dry, when she came for our game, but we were not ready. She left after she waited 5 minutes."
[3] "The day was nice and dry, when she came, we were not here. Our game was was not completed timely, but it was completed after one hour."
以下是所需的输出
[1] " came for our game we were ready"
[2] " came for our game, but we were"
[3] " not here. Our game was not completed"
答案 0 :(得分:2)
请尝试\\W{1,}
:
gsub(".*(((\\W{1,})\\w{1,}){3} game((\\W{1,})\\w{1,}){3}).*", "\\1", a, perl = TRUE)
[1] " came for our game we were ready"
" came for our game, but we were"
" not here. Our game was not completed"
答案 1 :(得分:0)
这是str_extract
包中stringr
的另一种方法:
library(stringr)
str_extract(a, "(( \\S+){3} game[[:punct:]\\s]*( \\S+){3})")
# [1] " came for our game we were ready"
# " came for our game, but we were"
# " not here. Our game was not completed"