Question

我正在尝试从文章中提取可能的作者姓名。我的工作假设是作者姓名是在行中

"By FirstName LastName"

或

"By FirstName MiddleName LastName"

并且第一个，中间名和姓氏都以大写字母开头。

如何使用正则表达式提取“By”后面的所有2-3个字符串，这些字符串也符合上述条件？

例如，如果文章有文本

"By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"

它会提取

"Barack Obama"

和

"January"

作为可能的作者姓名，然后我将做出确定哪一个是正确的工作。

目前我的正则表达式是：

/By ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/

但是，当我在字符串

上使用它时

"By Alex Jackson Olerud"

它似乎都返回了

"Alex Jackson Olerud"

和

" Olerud"

我使用Ruby作为我的首选语言，但任何与语言无关的解决方案都足够了。

Answer 1

这是我的建议：

str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president. 
By A. B. Cecil"

def find_authors(str)
    str.scan(/
    (?<name> # a named capture group for one of the names
            \p{Lu} # starts with an upper case letter, unicode so will work also for e.g. Åsa
            (?: \. | \p{Ll}+) # followed by a period or some lower case letters
    ){0} # zero matches, this is just a subroutine to be used again

    (?<=[Bb]y\s) # lookbehind to make sure the author is after a by or By
    (?<wholename> # capture group to extract the whole name
        \g<name> (\s \g<name>){1,2} # a name should have a least two components
    )/x).map(&:last) # remove the match by the <name> group from the result
end

def find_authors_oneline(str)
    str.scan(/(?<name>\p{Lu}(?:\.|\p{Ll}+)){0}(?<=[Bb]y\s)(?<wholename>\g<name>(\s\g<name>){1,2})/).map(&:last)
end

p find_authors str
>> ["Barack Obama", "A. B. Cecil"]
p find_authors_oneline str
>> ["Barack Obama", "A. B. Cecil"]

您可以阅读regex subroutines和regex /x modifier

Answer 2

我认为第二个捕获组(\s+[A-Z][\w-]*)会让你失望。尝试使用非捕获组，例如(?:\s+[A-Z][\w-]*)

Answer 3

str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"

str.scan(/(?:By )((?:[A-Z][A-Za-z]+ ?+)+)/).flatten.map(&:strip)
#=> ["Barack Obama", "January"]

如何提取“By”后面的单词来提取作者姓名

3 个答案: