使用python Regex从rss提要中提取内容

时间:2014-10-09 12:13:48

标签: python regex rss extract

我正在尝试使用正则表达式,特别是re模块来从rss提要中提取标题,日期和内容。到目前为止,我使用了以下代码:

    titles = re.findall(r'<title>(.*?)</title>',html_code)
    descriptions = re.findall(r'<description>(.*?)</description>',html_code)   
    dates = re.findall(r'<pubDate>(.*?)</pubDate>',html_code)

    for title in titles:
        if 'The Guardian' in title:
            pass
        else:
            print "Headline:" ,title
            print


    for description in descriptions:
        if 'Latest news and features from theguardian.com' in description:
            pass
        else:
            print "Description:" ,description
            print

    for date in dates:
        print "Date:" ,date
        print

此代码提供以下输出:

Headline: Tim Bresnan denies involvement in Kevin Pietersen parody Twitter account

Description: I 100% did NOT have any password, and wasnt involved&lt;br /&gt; ECB confirms Alec Stewart reported incident in 2012 &lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/2014/oct/08/kevin-pietersen-parody-twitter-account-author-denies-england-players-involved" title=""&gt; Twitter account author denies players were involved&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.theguardian.com/sport/blog/2014/oct/08/ecb-england-cricket-kevin-pietersen-tom-harrison" title=""&gt; Owen Gibson: ECB at crossroads amid fallout&lt;/a&gt;&lt;p&gt;Tim Bresnan has denied having any involvement in the controversial @KPgenius Twitter account after Kevin Pietersens autobiography claimed his former England team-mates were behind it.&lt;/p&gt;&lt;p&gt;In his book, Pietersen revealed the extent to which the account had angered and upset him, and claimed that the accounts author had told the former England wicketkeeper Alec Stewart that some of the guys in the dressing room are tweeting from it.&lt;/p&gt;&lt;p&gt;Disappointed to be implicated in the &lt;a href="https://twitter.com/hashtag/kpgenius?src=hash"&gt;#kpgenius&lt;/a&gt; account. I 100% did NOT have any password. And wasn't involved In any posting.&lt;/p&gt; &lt;a href="http://www.theguardian.com/sport/2014/oct/09/tim-bresnan-kevin-pietersen-parody-twitter"&gt;Continue reading...&lt;/a&gt;           

Date: Thu, 09 Oct 2014 11:56:43 GMT

为每篇新闻文章打印这些结果。我的问题是,我如何清理内容部分并删除所有的HTML垃圾?我只需要一些没有所有标签的文章的基本信息。我如何使用正则表达式来删除它们(例如链接和“。&amp; lt; / p&amp; gt;”)?三江源

1 个答案:

答案 0 :(得分:-1)

您可以使用str.replace()将特殊HTML字符替换为您需要替换它们的任何内容。