Question

我正在尝试删除此RSS Feed的所有标题：

http://www.quora.com/Python-programming-language-1/rss

这是我的相同代码：

import urllib2
import re
content = urllib2.urlopen('http://www.quora.com/Python-programming-language-1/rss').read()
allTitles =  re.compile('<title>(.*)</title>')
list = re.findall(allTitles,content)
for e in range(0, 2):
    print list[e]

但是，我没有获得标题列表作为输出，而是从rss源代码中获取了大量代码。我做错了什么？

Answer 1

你应该在表达式中使用非贪婪标记（？）：

#allTitles =  re.compile('<title>(.*)</title>')
allTitles =  re.compile('<title>(.*?)</title>')

没有?除（{*}}组中的最后</title>之外的所有文字......

Answer 2

如前所述，你的代码缺少regexp的贪婪说明符，可以用它修复。但我强烈建议从正则表达式切换到更适合xml解析的工具，例如lxml，BeautifulSoup或专门的rss解析模块，例如feedparser。

例如，看看如何使用lxml完成任务：

>>> import lxml.etree
>>> rss = lxml.etree.fromstring(content)
>>> titles = rss.findall('.//title')
>>> print '\n'.join(title.text for title in titles[:2])
Questions About Python (programming language) on Quora
Could someone explain for me the following Python function that uses @wraps from functools?

使用urllib2进行网络抓取

2 个答案: