使用python中的正则表达式剥离内容

时间:2014-10-11 06:14:27

标签: python html regex rss

我正在尝试单独使用re模块从rss提要中提取文本。到目前为止,我已经使用findall提取了描述,但我不知道从哪里开始。到目前为止我写过:

url = 'http://www.theguardian.com/sport/rss'
open_page = urlopen(url)
html_code = open_page.read()
open_page.close()

descriptions = re.findall(r'<description>(.*?)</description>',html_code)

for description in descriptions:
    if 'Latest news and features from theguardian.com' in description:
        pass
    else:
        print "Description:" ,description

此代码提供以下输出:

Description: Wales 0-0 Bosnia-Herzegovina&lt;p&gt;It was not &lt;a href="http://www.theguardian.com/football/2014/oct/09/wales-bosnia-chris-coleman-euro-2016-qualifier" title=""&gt;the victory that Chris Coleman, his players and the home supporters craved&lt;/a&gt; to ignite hopes of qualifying for the European Championships in France but this may well turn out to be a precious point for Wales. Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution Wayne Hennessey made in goal.&lt;/p&gt;&lt;p&gt;Unable to get into the Crystal Palace team at the moment, Hennessey produced half a dozen crucial stops here, including a triple save early in the second half and  perhaps most memorably of all  flicked Miralem Pjanics 30-yard free-kick over the bar eight minutes from time, when the Bosnia playmaker looked to have found the top corner.&lt;/p&gt; &lt;a href="http://www.theguardian.com/football/2014/oct/10/wales-bosnia-herzegovina-euro-2016-qualifying"&gt;Continue reading...&lt;/a&gt;

我想知道我可以用什么正则表达式来取出所有标签并留下纯文本(最多几个句子)。谁能帮我吗?

另外我明白使用beautifulsoup或htmlparser会更容易,但我只是想尝试使用re。

3 个答案:

答案 0 :(得分:1)

问题是每个描述标记内都有一个HTML代码。

在此处,您可以使用BeautifulSoup查找所有description代码,将其加载到单独的BeautifulSoup对象中并获取文字:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://www.theguardian.com/sport/rss'
soup = BeautifulSoup(urlopen(url))

for description in soup.find_all('description'):
    print BeautifulSoup(description.text).text

打印:

Latest news and features from theguardian.com, the world's leading liberal voice
Raheem Sterling and Calum Chambers making senior mark Players dont reach their best until theyre 27 or 28 Euro 2016 qualifier match report: England 5-0 San MarinoRoy Hodgson has admitted his successor as England manager may be the chief beneficiary of the crop of young players already making their mark in the senior team as the national set-up makes plans beyond the 2016 European Championships.The squad travel to Estonia on Saturday before their latest qualifying game having established themselves at the top of Group E and with a number of bright young things seizing their opportunity to establish credentials at the higher level. The team will be tested sternly in prestigious friendly fixtures over the next two years, with Italy confirmed as opponents next March, likely to be played in Turin, and negotiations close to conclusion to play France at the Stade de France, either in November 2015 or the March before the tournament. Continue reading...
...

答案 1 :(得分:1)

你的正则表达式很好。您需要做的就是删除描述中的所有标记。 re.sub功能可以帮助您解决此问题

>>>re.sub("<.*?>","","<h1>heading</h1>")
 heading

此处<.?*>匹配任何html标记,并将其替换为""

代码可以编辑为

url = 'http://www.theguardian.com/sport/rss'
open_page = urlopen(url)
html_code = open_page.read()
open_page.close()

descriptions = re.findall(r'<description>(.*?)</description>',html_code)


for description in descriptions:
    if 'Latest news and features from theguardian.com' in description:
        pass
    else:

        #edited here
        cont = re.sub("&lt.*?&gt","",description)

        print "Description:" ,cont

由于re.findall格式化输入字符串,方法是将<替换为&lt使用cont = re.sub("&lt.*?&gt","",description)

将产生输出

    Description: Wales 0-0 Bosnia-HerzegovinaIt was not the victory that Chris Coleman, his players and the home 
supporters craved to ignite hopes of qualifying for the European Championships in France but this may well turn out to 
be a precious point for Wales. Ashley Williams and Hal Robson-Kanu will have sleepless nights about the glorious chances 
they squandered late on but at the other end of the pitch it was impossible to overlook the outstanding contribution 
Wayne Hennessey made in goal.Unable to get into the Crystal Palace team at the moment, Hennessey produced half a dozen 
crucial stops here, including a triple save early in the second half and perhaps most memorably of all flicked Miralem 
Pjanics 30-yard free-kick over the bar eight minutes from time, when the Bosnia playmaker looked to have found the top 
corner. Continue reading...

答案 2 :(得分:0)

<[^>]*>

试试这个。您可以使用re.sub。替换为empty string。请参阅演示。

http://regex101.com/r/vR4fY4/9