我有一个文本字符串的矢量,例如:
Sentences <- c("I would have gotten the promotion, but TEST my attendance wasn’t good enough.Let me help you with your baggage.",
"Everyone was busy, so I went to the movie alone. Two seats were vacant.",
"TEST Rock music approaches at high velocity.",
"I am happy to take your TEST donation; any amount will be greatly TEST appreciated.",
"A purple pig and a green donkey TEST flew a TEST kite in the middle of the night and ended up sunburnt.",
"Rock music approaches at high velocity TEST.")
我想提取n (例如:三个)单词(一个单词的特征是字符前后的空格) AROUND (即,之前和之后)特定术语(例如,&#39; TEST&#39;)。 重要:几个 匹配应允许(即,如果特定术语出现多次,则预期的解决方案应捕获这些情况)。
结果可能如下(格式可以改进):
S1 <- c(before = "the promotion, but", after = "my attendance wasn’t")
S2 <- c(before = "", after = "")
S3 <- c(before = "", after = "Rock music approaches")
S4a <- c(before = "to take your", after = "donation; any amount")
S4b <- c(before = "will be greatly", after = "appreciated.")
S5a <- c(before = "a green donkey", after = "flew a TEST")
S5b <- c(before = "TEST flew", after = "kite in the")
S6 <- c(before = "at high velocit", after = "")
我该怎么做?我已经找到了其他的psots,它们是o nly for one-case-matches或与fixed sentence structures相关。
答案 0 :(得分:0)
quanteda 包具有很好的功能:kwic()
(上下文中的关键字)。
开箱即用,这对您的示例非常有效:
library("quanteda")
names(Sentences) <- paste0("S", seq_along(Sentences))
(kw <- kwic(Sentences, "TEST", window = 3))
#
# [S1, 9] promotion, but | TEST | my attendance wasn't
# [S3, 1] | TEST | Rock music approaches
# [S4, 7] to take your | TEST | donation; any
# [S4, 15] will be greatly | TEST | appreciated.
# [S5, 8] a green donkey | TEST | flew a TEST
# [S5, 11] TEST flew a | TEST | kite in the
# [S6, 7] at high velocity | TEST | .
(kw2 <- as.data.frame(kw)[, c("docname", "pre", "post")])
# docname pre post
# 1 S1 promotion , but my attendance wasn't
# 2 S3 Rock music approaches
# 3 S4 to take your donation ; any
# 4 S4 will be greatly appreciated .
# 5 S5 a green donkey flew a TEST
# 6 S5 TEST flew a kite in the
# 7 S6 at high velocity .
这可能是比你在问题中要求的单独对象更好的格式。但是为了尽可能接近你的目标,你可以进一步改变它如下。
# this picks up the empty matching sentence S2
(kw3 <- merge(kw2,
data.frame(docname = names(Sentences), stringsAsFactors = FALSE),
all.y = TRUE))
# replaces the NA with the empty string
kw4 <- as.data.frame(lapply(kw3, function(x) { x[is.na(x)] <- ""; x} ),
stringsAsFactors = FALSE)
# renames pre/post to before/after
names(kw4)[2:3] <- c("before", "after")
# makes the docname unique
kw4$docname <- make.unique(kw4$docname)
kw4
# docname before after
# 1 S1 promotion , but my attendance wasn't
# 2 S2
# 3 S3 Rock music approaches
# 4 S4 to take your donation ; any
# 5 S4.1 will be greatly appreciated .
# 6 S5 a green donkey flew a TEST
# 7 S5.1 TEST flew a kite in the
# 8 S6 at high velocity .