Question

我是Nicola，Python的新用户，没有真正的计算机编程背景。因此，我真的需要一些帮助解决我遇到的问题。我写了一个代码来从这个网页上抓取数据：

http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02

基本上，我的代码的目标是从页面中的所有表中抓取数据并将其写入txt文件。在这里，我粘贴我的代码：

#!/usr/bin/env python


from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os


def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)

    table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
        col = row.findAll('td')
        voce = col[0].string
        accertamento = col[1].string
        competenza = col[2].string
        residui = col[3].string
        record = (voce, accertamento, competenza, residui)
        print >> outfile, "|".join(record)


outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

一切都会正常工作，但该页面中某些表格的第一列包含带有重音字符的单词。当我运行代码时，我得到以下内容：

Traceback (most recent call last):
File "modena2.py", line 158, in <module>
  extract(soup1)
File "modena2.py", line 98, in extract
  print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)

我知道问题在于重音字符的编码。我试图找到解决方案，但它确实超出了我的知识。我想提前感谢所有帮助我的人。我真的很感激！对不起，如果这个问题太基础了，但正如我所说的那样，我只是开始使用python而且我自己也在学习所有东西。

谢谢！尼古拉

Answer 1

我将根据反馈再试一次。由于您使用print语句来生成输出，因此您的输出必须是字节而不是字符（这是当前操作系统的现实）。默认情况下，Python的sys.stdout（print语句写入的内容）使用'ascii'字符编码。因为只有字节值0到127由ASCII定义，所以这些是您可以打印的唯一字节值。因此字节值'\xe0'的错误。

您可以通过以下方式将sys.stdout的字符编码更改为UTF-8：

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u'|'.join([u'abc', u'\u0100'])

上面的print语句不会抱怨打印无法用ASCII编码表示的Unicode字符串。但是，下面的代码（打印字节而非字符）会产生UnicodeDecodeError异常，因此请注意：

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0'])

您可能会发现您的代码正在尝试打印字符，并且将sys.stdout的字符编码设置为UTF-8（或ISO-8859-1）可以修复它。但您可能会发现代码正在尝试打印字节（从BeautifulSoup API获取），在这种情况下修复可能是这样的：

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0']).decode('ISO-8859-1')

我不熟悉BeautifulSoup包，但我建议用各种文档对其进行测试，以确定它是否正确检测字符编码。您的代码没有明确提供编码，它显然决定了自己的编码。如果该决定来自meta编码标记，那么很棒。

Answer 2

编辑：我刚试了一下，因为我假设你最后想要一张桌子，这里有一个导致csv的解决方案。

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv


def extract(soup):
    table = soup.findAll("table")[1]
    for row in table.findAll('tr')[1:19]:
            col = row.findAll('td')
            voce = col[0].string
            accertamento = col[1].string
            competenza = col[2].string
            residui = col[3].string
            record = (voce, accertamento, competenza, residui)
            outfile.writerow([s.encode('utf8') if type(s) is unicode else s for s in record])

    # swap print for outfile statement in all other blocks as well
    # ... 

outfile = csv.writer(open(r'modena_quadro02.csv','wb'))
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)

Answer 3

上周我遇到了类似的问题。在我的IDE（PyCharm）中很容易修复。

这是我的修复：

从PyCharm菜单栏开始：文件 - ＆gt;设置... - ＆gt;编辑 - ＆gt;文件编码，然后设置：＆＃34; IDE编码＆＃34;，＆＃34;项目编码＆＃34;和＆＃34;属性文件的默认编码＆＃34; ALL到UTF-8，她现在就像一个魅力。

希望这有帮助！

Answer 4

问题在于将Unicode文本打印到二进制文件：

>>> print >>open('e0.txt', 'wb'), u'\xe0'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128)

要解决此问题，请将Unicode文本编码为字节（u'\xe0'.encode('utf-8')）或以文本模式打开文件：

#!/usr/bin/env python
from __future__ import print_function
import io

with io.open('e0.utf8.txt', encoding='utf-8') as file:
    print(u'\xe0', file=file)

Answer 5

尝试更改此行：

html1 = page1.read()

对此：

html1 = page1.read().decode(encoding)

encoding的位置，例如'UTF-8'，'ISO-8859-1'等。我不熟悉机械化包，但希望有一种方法可以发现文档的编码由read()方法返回。似乎read()方法给你一个字节字符串，而不是字符串，因此稍后的连接调用必须假定ASCII作为编码。

Python - 从网站上抓取数据时重音字符的问题

5 个答案: