复杂的正则表达式在python中提取作者姓名

时间:2011-07-04 21:54:48

标签: python regex

我正在尝试创建一个非常不成功的正则表达式,我想要做的是获取任何具有类(作者| byline | writer)的html元素的内容

这是我到目前为止所拥有的

<([A-Z][A-Z0-9]*)class=\"(byLineTag|byline|author|by)\"[^>]*>(.*?)</\1>

我需要匹配的例子

  <h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>

<div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>

任何帮助都会受到很多赞赏。 -Stefan

4 个答案:

答案 0 :(得分:2)

正则表达式并不是特别适合解析HTML 值得庆幸的是,有专门为解析HTML而创建的工具,例如BeautifulSouplxml;后者如下所示:

markup = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6><div class="noindex"><span class="by">By </span><span class="byline"><a href="javascript:NewWindow(575,480,'/apps/pbcs.dll/personalia?ID=sshemkus',0)" title="Email Reporter">Sarah Shemkus</a></span></div>'''

import lxml.html

import lxml.html
doc = lxml.html.fromstring(markup)
for a in doc.cssselect('.author, .by, .byline, .byLineTag'):
    print a.text_content()
# By JACK EWING and LANDON THOMAS Jr.
# By 
# Sarah Shemkus

答案 1 :(得分:2)

强烈建议使用正则表达式解析html,原因已提到。使用现有的HTML解析器。作为一个多么容易的例子,我已经包含了一个使用lxml和它的CSS选择器的例子。

from lxml import etree
from lxml.cssselect import CSSSelector

## Your html string
html_string = '''<h6 class="byline">By <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/e/jack_ewing/index.html?inline=nyt-per" title="More Articles by Jack Ewing" class="meta-per">JACK EWING</a> and <a rel="author" href="http://topics.nytimes.com/top/reference/timestopics/people/t/landon_jr_thomas/index.html?inline=nyt-per" title="More Articles by Landon Thomas Jr." class="meta-per">LANDON THOMAS Jr.</a></h6>'''

## lxml html parser
html = etree.HTML(html_string)

## lxml CSS selector
sel = CSSSelector('.author, .byline, .writer')

## Call the selector to get matches
matching_elements = sel(html)

for elem in matching_elements:
    primt elem.text

答案 2 :(得分:0)

试试这个:

<([A-Z][A-Z0-9]*).*?class=\"(byLineTag|byline|author|by)\"[^>]*?>(.*?)</\1>

我添加的内容:
  - 。*?,以防class属性出现在起始标记之后。
  - *?,将 * 运算符设置为非贪心,以便找到结束&gt;

答案 3 :(得分:0)

您忘记了标记名称和第一个属性名称之间的空格。此外,除非您确定class始终是第一个属性,否则您应该在表达式中考虑相反的情况。此外,\ 1应该是\ 0(后引用是零索引的),如果你真的关心结束标记。正如我在评论中指出的那样,您还应该在通配符中包含小写字符。

这是一个更好的表达式(我忽略了结束标记以使其更简单):

<[A-Za-z][A-Za-z0-9]*.*? class=["'](byLineTag|byline|author|by)["'][^>]*>

记住首先将所有行一起运行,以避免标记分成多行时出错。当然,如果你使用Python的HTML解析器,你可能会节省很多工作。