Question

我想使用Python在特定字符串后面提取MediaWiki标记中格式化的内容。例如，2012 U.S. presidential election article包含名为“nominee1”和“nominee2”的字段。玩具示例：

In [1]: markup = get_wikipedia_markup('United States presidential election, 2012')
In [2]: markup
Out[2]:
u"{{
| nominee1 = '''[[Barack Obama]]'''\n
| party1 = Democratic Party (United States)\n
| home_state1 = [[Illinois]]\n
| running_mate1 = '''[[Joe Biden]]'''\n
| nominee2 = [[Mitt Romney]]\n
| party2 = Republican Party (United States)\n
| home_state2 = [[Massachusetts]]\n
| running_mate2 = [[Paul Ryan]]\n
}}"

以上面的选举文章为例，我想在“nomineeN”字段之后立即提取信息，但是在调用下一个字段之前存在（由点“|”划分）。因此，鉴于上面的例子，我理想地想提取“巴拉克奥巴马”和“米特罗姆尼” - 或者至少是他们所嵌入的语法（'['[[Barack Obama]]''和[ [米特罗姆尼]]）。其他正则表达式有extracted links from the wikimarkup，但我（失败）尝试使用positive lookbehind assertion的尝试有以下几种：

nominees = re.findall(r'(?<=\|nominee\d\=)\S+',markup)

我的想法是它应该找到像“| nominee1 =”和“| nominee2 =”这样的字符串，在“|”，“nominee”，“=”之间可能会有一些空格，然后像“Barack Obama”那样返回其后面的内容“和”米特罗姆尼“。

Answer 1

使用mwparserfromhell！它会压缩您的代码，并且可以更加安心地捕获结果。用于此示例：

import mwparserfromhell as mw
text = get_wikipedia_markup('United States presidential election, 2012')
code = mw.parse(text)
templates = code.filter_templates()
for template in templates:
    if template.name == 'Infobox election':
        nominee1 = template.get('nominee1').value
        nominee2 = template.get('nominee2').value
print nominee1
print nominee2

捕获结果非常简单。

Answer 2

此处不需要Lookbehinds - 使用匹配组来准确指定应从字符串中提取的内容要容易得多。（事实上，使用Python的正则表达式引擎，lookbehinds无法在这里工作，因为可选空格使表达式变宽。）

试试这个正则表达式：

\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?

结果：

re.findall(r"\|\s*nominee\d+\s*=\s*(?:''')?\[\[([^]]+)\]\](?:''')?", markup)
# => ['Barack Obama', 'Mitt Romney']

Answer 3

对于像这样的信息框数据，最好使用DBpedia。他们为你完成了所有提取工作：）

http://wiki.dbpedia.org/Downloads38

请参阅“Ontology Infobox Properties”文件。您不必是这里的本体专家。只需使用简单的tsv解析器来查找所需的信息！

Answer 4

首先，你在nominee\d之后错过了一个空格。你可能想要nominee\d\s*\=。另外，你真的不希望用正则表达式解析标记。请尝试使用其中一个建议here。

如果你必须使用正则表达式，为什么不是一个稍微更易读的多线解决方案？

import re

markup_string = """{{
| nominee1 = '''[[Barack Obama]]'''
| party1 = Democratic Party (United States)
| home_state1 = [[Illinois]]
| running_mate1 = '''[[Joe Biden]]'''
| nominee2 = [[Mitt Romney]]
| party2 = Republican Party (United States)
| home_state2 = [[Massachusetts]]
| running_mate2 = [[Paul Ryan]]<br>
}}"""

for match in re.finditer(r'(nominee\d\s*\=)[^|]*', markup_string, re.S):
    end_nominee, end_line = match.end(1), match.end(0)
    print end_nominee, end_line
    print markup_string[end_nominee:end_line]

用于从wiki模板标记中提取字段的正则表达式

4 个答案: