如何使用R

时间:2015-07-21 09:24:40

标签: regex r tm opennlp

我使用R从文本中提取包含特定人名的句子,这是一个示例段落:

  

作为蒂宾根的改革者,他接受了由他的叔叔Johann Reuchlin推荐的Martin Luther对维滕贝格大学的电话。 Melanchthon在21岁时成为维滕贝格的希腊语教授。他研究了圣经,尤其是保罗和福音派教义。他作为旁观者出席了莱比锡(1519)的辩论,但参与了他的评论。约翰·埃克(Johann Eck)攻击了他的观点,Melanchthon在他的Defensio对手Johannem Eckium的基础上回答了圣经的权威。

在这个短段中,有几个人名,例如: Johann Reuchlin Melanchthon Johann Eck 。在 openNLP 包的帮助下,可以正确提取和识别三个人名 Martin Luther Paul Melanchthon 。然后我有两个问题:

  1. 我如何提取包含这些名称的句子
  2. 由于命名实体识别器的输出不是那么有希望,如果我为每个名称添加“[[]]”,如[[Johann Reuchlin]],[[Melanchthon]],我怎样才能提取句子包含这些名称表达式 [[A]],[[B]] ......?

2 个答案:

答案 0 :(得分:7)

Using `strsplit` and `grep`, first I set made an object `para` which was your paragraph.

toMatch <- c("Martin Luther", "Paul", "Melanchthon")

unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]


> unlist(strsplit(para,split="\\."))[grep(paste(toMatch, collapse="|"),unlist(strsplit(para,split="\\.")))]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                                    
[3] " He studied the Scripture, especially of Paul, and Evangelical doctrine"                                                                               
[4] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"    

或者更清洁一点:

sentences<-unlist(strsplit(para,split="\\."))
sentences[grep(paste(toMatch, collapse="|"),sentences)]

如果您正在寻找每个人所处的句子作为单独的回报,那么:

toMatch <- c("Martin Luther", "Paul", "Melanchthon")
sentences<-unlist(strsplit(para,split="\\."))
foo<-function(Match){sentences[grep(Match,sentences)]}
lapply(toMatch,foo)

[[1]]
[1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑3:要添加每个人的姓名,请执行以下操作:

foo<-function(Match){c(Match,sentences[grep(Match,sentences)])}

编辑4:

如果你想找到有多个人/地方/事物(单词)的句子,那么只需为这两个人添加一个参数,例如:

toMatch <- c("Martin Luther", "Paul", "Melanchthon","(?=.*Melanchthon)(?=.*Scripture)")

并将perl更改为TRUE

foo<-function(Match){c(Match,sentences[grep(Match,sentences,perl = T)])}


> lapply(toMatch,foo)
[[1]]
[1] "Martin Luther"                                                                                                                                         
[2] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin"

[[2]]
[1] "Paul"                                                                   
[2] " He studied the Scripture, especially of Paul, and Evangelical doctrine"

[[3]]
[1] "Melanchthon"                                                                                                                          
[2] " Melanchthon became professor of the Greek language in Wittenberg at the age of 21"                                                   
[3] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

[[4]]
[1] "(?=.*Melanchthon)(?=.*Scripture)"                                                                                                     
[2] " Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium"

编辑5:回答你的其他问题:

假设:

sentenceR<-"Opposed as a reformer at [[Tübingen]], he accepted a call to the University of [[Wittenberg]] by [[Martin Luther]], recommended by his great-uncle [[Johann Reuchlin]]"

gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])

会在双括号内给出单词。

> gsub("\\[\\[|\\]\\]", "", regmatches(sentenceR, gregexpr("\\[\\[.*?\\]\\]", sentenceR))[[1]])
[1] "Tübingen"        "Wittenberg"      "Martin Luther"   "Johann Reuchlin"

答案 1 :(得分:3)

这是一个使用两个包 quanteda stringi 的相当简单的方法:

sents <- unlist(quanteda::tokenize(txt, what = "sentence"))
namesToExtract <- c("Martin Luther", "Paul", "Melanchthon")
namesFound <- unlist(stringi::stri_extract_all_regex(sents, paste(namesToExtract, collapse = "|")))
sentList <- split(sents, list(namesFound))

sentList[["Melanchthon"]]
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."

sentList
## $`Martin Luther`
## [1] "Opposed as a reformer at Tübingen, he accepted a call to the University of Wittenberg by Martin Luther, recommended by his great-uncle Johann Reuchlin."
## 
## $Melanchthon
## [1] "Melanchthon became professor of the Greek language in Wittenberg at the age of 21."                                                   
## [2] "Johann Eck having attacked his views, Melanchthon replied based on the authority of Scripture in his Defensio contra Johannem Eckium."
## 
## $Paul
## [1] "He studied the Scripture, especially of Paul, and Evangelical doctrine."