这个问题是这个问题的延伸:Find the names contained in each sentence (not the other way around)
我会在这里写相关部分。由此:
> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
我们获得了这个结果:
library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"
#[[2]]
#[1] "Melanchthon" "Martin Luther"
#[[3]]
#[1] "Paul"
#[[4]]
#[1] NA
#[[5]]
#[1] "Melanchthon"
但对于大toMatch
向量,将其值与OR运算符连接可能效率不高。所以我的问题是,如何使用函数或循环获得相同的结果?也许通过这种方式,可以使用\<
或\b
之类的正则表达式来表示toMatch
值,这样系统只会查找整个单词而不是字符串。
我已尝试过此操作,但不知道如何在lst
中保存匹配项以获得与上述相同的结果。
for(i in 1:length(sentences)){
for(j in 1:length(toMatch)){
lst<-str_extract_all(sentences[i], toMatch[j])
}}
答案 0 :(得分:1)
你期待这样的事吗?
library(stringr)
sentences <- c(
"Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin",
" Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther",
" He studied the Scripture, especially of Paul, and Evangelical doctrine",
" He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments.",
" Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium")
toMatch <- c("Martin Luther", "Paul", "Melanchthon")
for(i in 1:length(sentences)){
lst[[i]] <- NA * seq(length(toMatch))
for(j in 1:length(toMatch)){
tmp = str_extract_all(sentences[i], toMatch[j])
if (length(tmp[[1]]) > 0) {
lst[[i]][j] <- tmp[[1]]
}
}}
lapply(lst, function(x) x[!is.na(x)])
lst