Question

我正在解析英语维基百科的XML转储，并且一直致力于在Python中编写杀死正则表达式。我附加了一个示例数据片段，如下所示：

Amy Jean Klobuchar'''（{{IPAc-en | | k | l |oʊ| b |ə|ʃ|ɑr}}; 5月25日出生，   1960年是[[美国参议院资历|高级]] [[美国]   国家参议员]]来自[[明尼苏达州]]。她是该委员会的成员   [[明尼苏达民主党 - 农民 - 工党]]，该联盟   [[民主党（美国）|民主党]]。她是   第一位被选为明尼苏达州参议员的女性是其中之一   在[[美国参议院]]服务的二十一名妇女。

她曾担任[[县律师]] [[Hennepin   县，明尼苏达州]]，明尼苏达州人口最多的县。作为一个   律师，她曾与前[[美国副总统]合作   国家|副总统]] [[Walter Mondale]]。{{cite web |   作者=参议院网站| title =美国明尼苏达州参议员艾米   Klobuchar：传记|年= 2007年| URL =   [URL] | accessdate =   2007-02-23 | archiveurl =   [URL]   | archivedate = 2007年2月21日}}她一直在“崛起   明星“在民主党。{{Cite   新闻| URL = [URL] |标题=赫芬顿   发帖名称Klobuchar是最聪明的美国人   参议员| last = Tsukayama |第一名= Hayley |日期= 3月15日，   2010 | work = | access-date = 5月14日，   2017 |归档URL = |归档日期= |死URL =}} {{引用   新闻| URL = [URL] |标题=作为   国家唯一的参议员，Klobuchar获得同情   关注| last = Dizikes | first = Cynthia | date = 5月20日，   2009 | work = MinnPost | access-date = 5月14日，   2017 |存档-URL = |归档日期= |死URL = |语言= EN}}

==早期生活和教育==

出生于[[普利茅斯，明尼苏达]]，Klobuchar是Rose Katherine（néeHeuberger）的女儿，   从教学二年级开始，70岁退休，{{Cite   新闻| URL = [URL] |标题=玫瑰   Klobuchar，参议员Amy Klobuchar的母亲，   die | last = Nelson | first = Tim | access-date = 2017-02-22}}和[[Jim   Klobuchar | James John“Jim”Klobuchar]]，作者和退休人士   '[[Star Tribune]]''的体育记者和专栏作家。{{Cite   新闻| URL = [URL] |标题=出生   骑车：Jim Klobuchar和明尼苏达自行车的诞生   tour | newspaper = Star Tribune | access-date = 2017-02-22}}艾米有一个   妹妹。吉姆的祖父母是[[斯洛文尼亚   美国|斯洛文尼亚]移民，他的父亲是一名矿工   [[铁范围]];艾米的外祖父母来自   [[瑞士]]。{{举   网| URL = [URL] |标题= 1 |工作= rootsweb.com | accessdate = 11   2015年9月}}

根据这些数据，我想解析1） ref 标记和内容之间和2）部分标题。例如，ref标签及其内容表示

<ref name=bio> 
{{cite web
  |author=Senate Web site
  |title=U.S.Senator for Minnesota Amy Klobuchar: Biography
  |year=2007
  |url=[URL]
  |accessdate=2007-02-23
  |archiveurl=[URL]
  |archivedate = February 21, 2007}}
</ref>

而节标题表示

==Early life and education==

我实际上已成功使用以下代码解析这些字段：

import re


LEXEME = [
  ('ref', re.compile(r'<ref[^/>]*>[\s\S]*?</ref>)', 
    re.M | re.I)), 
  ('header', re.compile(r'(^|\n)((==[^=]+==)|(===[^=]+===)|(====[^=]+====))\s*$', 
    re.M | re.I))]

GROUP_RE = re.compile(
  '|'.join('(?P<{0}>{1})'.format(name, regex.pattern) 
    for name, regex in LEXEME), 
  re.M | re.I)


for match in GROUP_RE.finditer(content):
  print(match.lastgroup, '\t', match.group(0), '\n')

# Output
ref  <ref name=bio>{{cite web| author= Senate Web site| title = U.S. Senator for Minnesota Amy Klobuchar: Biography| year = 2007| url= [URL]| accessdate= 2007-02-23|archiveurl = [URL] |archivedate = February 21, 2007}}</ref>
ref  <ref>{{Cite news|url=[URL]|title=Huffington Post names Klobuchar the smartest U.S. Senator|last=Tsukayama|first=Hayley|date=March 15, 2010|work=|access-date=May 14, 2017|archive-url=|archive-date=|dead-url=}}</ref>
ref  <ref>{{Cite news|url=[URL]|title=As state's only senator, Klobuchar gains sympathetic attention|last=Dizikes|first=Cynthia|date=May 20, 2009|work=MinnPost|access-date=May 14, 2017|archive-url=|archive-date=|dead-url=|language=en}}</ref>
header  ==Early life and education==

我想扩展当前的正则表达式，以便在解析ref标签及其内容时，我还可以获得最多250个字符的前置和后续文本。例如，我想要

She previously served as the [[county attorney]] for [[Hennepin County, Minnesota]], the most populous county in Minnesota. As an attorney, she worked with former [[Vice President of the United States|Vice President]] [[Walter Mondale]].
<ref name=bio>
{{cite web
  |author=Senate Web site
  |title=U.S. Senator for Minnesota Amy Klobuchar: Biography
  |year=2007
  |url=[URL]
  |accessdate=2007-02-23
  |archiveurl=[URL] 
  |archivedate=February 21, 2007}}
</ref>
She has been called a "rising star" in the Democratic Party.<ref>{{Cite news|url=[URL]|title=Huffington Post names Klobuchar the smartest U.S

而不是

<ref name=bio>
{{cite web
  |author=Senate Web site
  |title=U.S. Senator for Minnesota Amy Klobuchar: Biography
  |year=2007
  |url=[URL]
  |accessdate=2007-02-23
  |archiveurl=[URL] 
  |archivedate=February 21, 2007}}
</ref>

所以，我修改了我的代码如下：

LEXEME = [
  ('ref', re.compile(r'([\s\S]{1,250})(<ref[^/>]*>[\s\S]*?</ref>)([\s\S]{1,250}))', 
    re.M | re.I)), 
  ('header', re.compile(r'(^|\n)((==[^=]+==)|(===[^=]+===)|(====[^=]+====))\s*$', 
    re.M | re.I))]

出现一些问题

1）当句子末尾连续出现多个ref标签时：

<ref>{{Cite
news|url=[URL]|title=Huffington
Post names Klobuchar the smartest U.S.
Senator|last=Tsukayama|first=Hayley|date=March 15,
2010|work=|access-date=May 14,
2017|archive-url=|archive-date=|dead-url=}}</ref><ref>{{Cite
news|url=[URL]|title=As
state's only senator, Klobuchar gains sympathetic
attention|last=Dizikes|first=Cynthia|date=May 20,
2009|work=MinnPost|access-date=May 14,
2017|archive-url=|archive-date=|dead-url=|language=en}}</ref>

预期结果是

ref  up-to-250-chars<ref>content</ref>up-to-250-chars
ref  up-to-250-chars<ref>content</ref>up-to-250-chars

但是，代码仅捕获后面的ref标签和相关的前/后文本。

2）当标题出现在后续文本中时。 ref regex将标题捕获为后续文本，并跳过标题正则表达式，如下所示。

ref  s-date=May 14,
2017|archive-url=|archive-date=|dead-url=|language=en}}</ref>

==Early life and education== Born in [[Plymouth, Minnesota]], Klobuchar is the daughter of Rose Katherine (née Heuberger), who
retired at age 70 from teaching second grade,<ref>{{Cite
news|url=[URL]|title=Rose
Klobuchar, mother of Sen. Amy Klobuchar,
dies|last=Nelson|first=Tim|access-date=2017-02-22}}</ref> and [[Jim
Klobuchar|James John "Jim" Klobuchar]], an author and a retired
sportswriter and columnist for the ''[[Star Tribune]]''.<ref>{{Cite
news|url=[URL]

我想知道如何解决这个问题。

快乐的编码！

Answer 1

虽然通过使用lookarounds可以（可能）使用正则表达式，但使用字符串操作要容易得多：

for match in GROUP_RE.finditer(content):
    start= max(0, match.start()-250)
    end= min(len(content), match.end()+250)
    matched_text= content[start:end]

在Python中编写正则表达式以解析前置和后续文本

1 个答案: