使用BeautifulSoup抓取数据的问题

时间:2010-07-01 13:53:00

标签: python loops beautifulsoup web-scraping

我已经写了以下试验代码来审阅欧洲议会的立法行为标题。

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

但是,每当我运行它时,我都会收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

我已将其缩小到BeautifulSoup,无法读取循环中的第四个文档。谁能向我解释我做错了什么?

亲切的问候

托马斯

3 个答案:

答案 0 :(得分:4)

BeautifulSoup在Unicode中运行,因此它不对解码错误负责。更有可能的是,您的问题伴随着print语句 - 您的标准输出似乎是在ascii中(即sys.stdout.encoding = 'ascii'或不存在),因此如果尝试打印包含字符串的字符串,您确实会遇到此类错误非ascii字符。

你的操作系统是什么?您的控制台AKA终端如何设置(例如,如果在Windows上是什么“代码页”)?您是否在环境PYTHONIOENCODING中设置了控件sys.stdout.encoding,或者您是希望自动选择编码?

在我的Mac上,检测到编码 正确无误,运行代码(为了清晰起见,另外还为每个标题打印数字除外)工作正常并显示:

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$ 

答案 1 :(得分:1)

更换

print title

for t in title:
    print(t)

print('\n'.join(t.string for t in title))

的工作原理。我不完全确定为什么print <somelist>有时会起作用,有时却不然。

答案 2 :(得分:0)

如果要将标题打印到文件,则需要指定一些可以表示非ascii字符的编码,utf8应该可以正常工作。为此,您需要添加:

out = codecs.open('titles.txt', 'w', 'utf8')

位于脚本顶部

并打印到文件:

print >> out, title