我正在尝试从文章中提取可能的作者姓名。我的工作假设是作者姓名是在行中
"By FirstName LastName"
或
"By FirstName MiddleName LastName"
并且第一个,中间名和姓氏都以大写字母开头。
如何使用正则表达式提取“By”后面的所有2-3个字符串,这些字符串也符合上述条件?
例如,如果文章有文本
"By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
它会提取
"Barack Obama"
和
"January"
作为可能的作者姓名,然后我将做出确定哪一个是正确的工作。
目前我的正则表达式是:
/By ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)/
但是,当我在字符串
上使用它时"By Alex Jackson Olerud"
它似乎都返回了
"Alex Jackson Olerud"
和
" Olerud"
我使用Ruby作为我的首选语言,但任何与语言无关的解决方案都足够了。
答案 0 :(得分:3)
这是我的建议:
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president.
By A. B. Cecil"
def find_authors(str)
str.scan(/
(?<name> # a named capture group for one of the names
\p{Lu} # starts with an upper case letter, unicode so will work also for e.g. Åsa
(?: \. | \p{Ll}+) # followed by a period or some lower case letters
){0} # zero matches, this is just a subroutine to be used again
(?<=[Bb]y\s) # lookbehind to make sure the author is after a by or By
(?<wholename> # capture group to extract the whole name
\g<name> (\s \g<name>){1,2} # a name should have a least two components
)/x).map(&:last) # remove the match by the <name> group from the result
end
def find_authors_oneline(str)
str.scan(/(?<name>\p{Lu}(?:\.|\p{Ll}+)){0}(?<=[Bb]y\s)(?<wholename>\g<name>(\s\g<name>){1,2})/).map(&:last)
end
p find_authors str
>> ["Barack Obama", "A. B. Cecil"]
p find_authors_oneline str
>> ["Barack Obama", "A. B. Cecil"]
答案 1 :(得分:2)
我认为第二个捕获组(\s+[A-Z][\w-]*)
会让你失望。尝试使用非捕获组,例如(?:\s+[A-Z][\w-]*)
答案 2 :(得分:1)
str = "By Barack Obama on January 20th 2017. By January 2017, we all know Obama will no longer be the president"
str.scan(/(?:By )((?:[A-Z][A-Za-z]+ ?+)+)/).flatten.map(&:strip)
#=> ["Barack Obama", "January"]