通过大型名称向量查找每个句子中包含的名称

时间:2017-07-05 08:54:09

标签: r

这个问题是这个问题的延伸:Find the names contained in each sentence (not the other way around)

我会在这里写相关部分。由此:

> sentences
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"
[4] " He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments."                                                                          
[5] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

我们获得了这个结果:

library(stringr)
lst <- str_extract_all(sentences, paste(toMatch, collapse="|"))
lst[lengths(lst)==0] <- NA
lst
#[[1]]
#[1] "Martin Luther"

#[[2]]
#[1] "Melanchthon"   "Martin Luther"

#[[3]]
#[1] "Paul"

#[[4]]
#[1] NA

#[[5]]
#[1] "Melanchthon"

但对于大toMatch向量,将其值与OR运算符连接可能效率不高。所以我的问题是,如何使用函数或循环获得相同的结果?也许通过这种方式,可以使用\<\b之类的正则表达式来表示toMatch值,这样系统只会查找整个单词而不是字符串。

我已尝试过此操作,但不知道如何在lst中保存匹配项以获得与上述相同的结果。

for(i in 1:length(sentences)){
    for(j in 1:length(toMatch)){
        lst<-str_extract_all(sentences[i], toMatch[j])
        }}

1 个答案:

答案 0 :(得分:1)

你期待这样的事吗?

library(stringr)

sentences <- c(
"Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin",
" Melanchthon became professor of the Greek language in Wittenberg at the age of 21 with the help of Martin Luther",
" He studied the Scripture, especially of Paul, and Evangelical doctrine",
" He was present at the disputation of Leipzig (1519) as a spectator, but participated by his comments.",                                          
" Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium")

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

for(i in 1:length(sentences)){
  lst[[i]] <- NA * seq(length(toMatch))
  for(j in 1:length(toMatch)){
    tmp = str_extract_all(sentences[i], toMatch[j])
    if (length(tmp[[1]]) > 0) {
      lst[[i]][j] <- tmp[[1]]
    }
  }}
lapply(lst, function(x) x[!is.na(x)])
lst