两个如何在句点分隔的句子内以任意顺序找到两个单词

时间:2018-12-08 18:05:22

标签: r

我正在尝试提取任何句子(定义为两个句点之间),这些句子中两个单词columnBarr的顺序不限。这很棘手,因为目前我创建了一个正则表达式,该正则表达式只能在句点之前以任意顺序找到两个单词,但是如果两个单词中都存在这两个单词,则将选择两个句子之间的所有文本。如何使正则表达式句子具体化?

输入

try<-c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.")

所需的输出

[1] NA
[2] "I am a sentence and I contain column but also Barr.

尝试

str_extract_all(try,“ \ .. * column。 Barr。?\。|。* Barr。 column。?\。”)

当前输出

[[1]]
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."

[[2]]
[1] ". I am a sentence and I contain column but also Barr. I only contain Barr."

3 个答案:

答案 0 :(得分:3)

为了找到以任意顺序出现的两个单词,可以使用两个正向先行: 例如,grepl((?=.*Barr)(?=.*column),x,perl=T)每次出现两个单词时都会返回TRUE,而不管它们的顺序如何,否则返回FALSE,但这没有考虑句子的结构。 当您要提取文本并要在点之间找到两个单词时,我们可以将其更改为:

library(stringr)
## Example data
x <- c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.","Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.","Barr and column and also column. But just Barr. And just column. Now again column and Barr")
> x
[1] "I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well."
[2] "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too."       
[3] "Barr and column and also column. But just Barr. And just column. Now again column and Barr"           

str_extract_all(x,"(\\.|^)(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*(\\.|$)")

这将寻找句子的开头或句点(\\.|^),然后是非点的字符,并且包含Barr和列(?=[^\\.]*Barr)(?=[^\\.]*column)[^\\.]*,后面是点或句子的结尾{{1 }}。 这将返回一个列表:

(\\.|$)

答案 1 :(得分:1)

此正则表达式似乎可以满足您的要求:

public function show(House $house)
{
        return view('house.show', compact('house'));
}

它将以一个点((\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*) )开头,并捕获不是点但也具有.column的所有内容。或两个单词以不同顺序相同。

示例:

Barr

结果:

try = c("I am a sentence.I am a sentence and I contain Barr. I contain other things. I contain column as well.",
        "Here we go. I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
        "I am a sentence and I contain column but also Barr. I only contain Barr. I am too.",
        "I contain column and Barr. I have Barr and column. I don't.",
        "Hello. I contain Barr and column but also Barr. I only contain Barr. I am too.") 

k = sapply(try, function(x){
  str_extract(paste0(".",x), "(\\.[^.]*column[^.]*Barr[^.]*)|(\\.[^.]*Barr[^.]*column[^.]*)")
})
names(k) = NULL

如果您使用[1] NA [2] ". I am a sentence and I contain column but also Barr" [3] ".I am a sentence and I contain column but also Barr" [4] ".I contain column and Barr" [5] ". I contain Barr and column but also Barr" ,请记住它会返回匹配列表。

str_extract_all

我添加了[[1]] character(0) [[2]] [1] ". I am a sentence and I contain column but also Barr" [[3]] [1] ".I am a sentence and I contain column but also Barr" [[4]] [1] ".I contain column and Barr" ". I have Barr and column" [[5]] [1] ". I contain Barr and column but also Barr" 以便检测包含两个单词并且是第一个单词的句子(它们不以句点开头)。

答案 2 :(得分:1)

这是一种更通用的尝试,不需要创建所需单词的所有排列,当需要两个以上的作品时很有用。

该策略是找到每个单词的句子,然后找到结果的交集。

#split the long text into individual sentences.
sentences<-strsplit(try, "\\.")

#create list of matches for each desired word
columnlist<-lapply(sentences, function(x) {grep("(column)", x)})
barrlist<-lapply(sentences, function(x) {grep("(Barr)", x)})

#find intersection between lists
intersection<-lapply(seq_along(columnlist), function(i){intersect(columnlist[[i]], barrlist[[i]])} )

#extract the sentences out
answer<-sapply(seq_along(intersection), function(i) { 
  if(length(intersection[[i]])) 
    {trimws(sentences[[i]][intersection[[i]] ])}  
  else {NA}
})

结果

#[[1]]
#[1] NA
#
#[[2]]
#[1] "I am a sentence and I contain column but also Barr"