我已经写了以下试验代码来审阅欧洲议会的立法行为标题。
import urllib2
from BeautifulSoup import BeautifulSoup
search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"
for number in xrange(1,10):
url = search_url % number
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page)
title = soup.findAll("title")
print title
但是,每当我运行它时,我都会收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)
我已将其缩小到BeautifulSoup,无法读取循环中的第四个文档。谁能向我解释我做错了什么?
亲切的问候
托马斯
答案 0 :(得分:4)
BeautifulSoup在Unicode中运行,因此它不对解码错误负责。更有可能的是,您的问题伴随着print
语句 - 您的标准输出似乎是在ascii中(即sys.stdout.encoding = 'ascii'
或不存在),因此如果尝试打印包含字符串的字符串,您确实会遇到此类错误非ascii字符。
你的操作系统是什么?您的控制台AKA终端如何设置(例如,如果在Windows上是什么“代码页”)?您是否在环境PYTHONIOENCODING
中设置了控件sys.stdout.encoding
,或者您是希望自动选择编码?
在我的Mac上,检测到编码 正确无误,运行代码(为了清晰起见,另外还为每个标题打印数字除外)工作正常并显示:
$ python ebs.py
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$
答案 1 :(得分:1)
更换
print title
与
for t in title:
print(t)
或
print('\n'.join(t.string for t in title))
的工作原理。我不完全确定为什么print <somelist>
有时会起作用,有时却不然。
答案 2 :(得分:0)
如果要将标题打印到文件,则需要指定一些可以表示非ascii字符的编码,utf8应该可以正常工作。为此,您需要添加:
out = codecs.open('titles.txt', 'w', 'utf8')
位于脚本顶部
并打印到文件:
print >> out, title