Question

嘿，我想从文本中提取名称-我的识别模式是名称将始终以大写字母开头，并且将连续出现两个或三个带有大写字母的单词。此外，我考虑到这样一个事实，即可能会有一个叫“ Jack Jr. Bones”的作者-所以我写了“。”可选的。最后一种情况可能是文本中存在一个带有例如“罗伯特·布朗剧院，所以我想排除所有情况，其中带有大写字母的两个/三个词前面带有” the”。我通过在后面加上一个负数来实现：

test <- test <- "A beautiful day for Jack Bones ended in the Robert Brown theater"
str_extract(test, "(?<!the\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "Jack Bones"

但是现在我面临以下问题：如果一个句子以“罗伯特·布朗剧院”开头，那么我也将匹配此模式。我以为我很聪明，只要在负面表情后面加上“（？i），但事实证明那是行不通的

test <- "The Robert Brown theater was nice, but Jack Bones did not enjoy his time there"
str_extract(test, "(?<!(?i)the\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"

另一个想法是只添加一个或条件

str_extract(test, "(?<!(the\\s|The\\s))(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"

然后我尝试了一下，如果在否定的外观中仅使用“ The”，是否会起作用，我发现即使这样也无法实现

str_extract(test, "(?<!The\\s)(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "The Robert Brown"

现在我有点笨了。我不明白为什么背后的负面看法适用于“ the”，但如果我以“ The”为条件，则无法奏效。我将不胜感激！

Answer 1

它是the greatest regex trick ever的变体：

 match_this | or_this | (but_really_keep_this)

就R而言，您可以在perl = TRUE中使用经常被忽略的标准正则表达式函数：

test <- c("A beautiful day for Jack Bones ended in the Robert Brown theater",
          "The Robert Brown theater was nice, but Jack Bones did not enjoy his time there")

pattern <- "(?:[Tt]he\\s+(?:[A-Z][\\w.]*\\s*){2,3})(*SKIP)(*FAIL)|(?:[A-Z][\\w.]*\\s*){2,3}"

m <- gregexpr(pattern, test, perl = T)
lapply(regmatches(test, m), trimws)

哪个产量

[[1]]
[1] "Jack Bones"

[[2]]
[1] "Jack Bones"

您会看到，使用的模式基本上是这样的：

The/the Word1 Word2 Word3 | (Word1 Word2 Word2)

您甚至可以将代码缩短为非常难以理解的单行代码（尽管不建议这样做）：

lapply(regmatches(test, gregexpr(pattern, test, perl = T)), trimws)

Answer 2

我认为您需要的是负面的前瞻性。您可以看到它here

(?!(the\\s|The\\s))(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))

您自己的正则表达式几乎可以解决问题。

有关此内容的更多信息，您可以检查this链接

Answer 3

最后，我还发现了为什么我自己的代码在此示例中失败。

我的模式已经匹配“ The Robert”，然后检查前面是否有the或The，当然不是这样。因此，我需要对“ The”进行额外的展望：

test <- "The Robert Brown theater was nice, but Jack Bones and Hover Edgar did not enjoy his time there"
str_extract(test, "(?<![Tt]he\\s)((?!The))(([A-Z][\\w]+\\s[A-Z][\\w]+[[:punct:]]?\\s[A-Z][\\w]+)|([A-Z][\\w]+\\s[A-Z][\\w]+))")
[1] "Jack Bones"

整理代码可得出：

str_extract(test,"(?<![Tt]he\\s)((?!The))[A-Z][\\w]+\\s[A-Z][\\w]+([[:punct:]]\\s[A-Z][\\w]+)?")
[1] "Jack Bones"

此解决方案的另一个优点是，我可以停留在str_extract框架内，而不必迁移到R中允许Perl语法的另一个函数。

R中的正则表达式负向后看

3 个答案: