Question

我正在编写一个用python抓取Wikipedia表的程序。一切正常，除了一些似乎似乎没有被python正确编码的字符。

以下是代码：

import csv
import requests
from BeautifulSoup import BeautifulSoup
import sys

reload(sys)
sys.setdefaultencoding( "utf-8" )

url = 'https://en.wikipedia.org/wiki/List_of_airports_by_IATA_code:_A'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'wikitable sortable'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

outfile = open("./scrapedata.csv", "wb")
writer = csv.writer(outfile)
print list_of_rows
writer.writerows(list_of_rows)

例如，Merzbrück被编码为MerzbrÃ¼ck。这个问题或多或少似乎与scandics（é，è，ç，à等）有关。有没有办法可以避免这种情况？在此先感谢您的帮助。

Answer 1

这当然是编码问题。问题是其中。我的建议是，您完成每个步骤并查看原始数据，看看是否可以找出编码问题的确切位置。

因此，例如，打印response.content以查看符号是否与requests对象中的符号一致。如果是，请继续，然后查看soup.prettify()以查看BeautifulSoup对象是否正常，然后list_of_rows等。

所有这一切，我怀疑这个问题与写csv有关。 csv documentation有一个如何将unicode写入csv的示例。 This answer也可以帮助您解决问题。

为了它的价值，我能够使用pandas库将正确的符号写入csv（我正在使用python 3，因此您的体验或语法可能会略有不同，因为它看起来像您正在使用python 2）：

import pandas as pd

df = pd.DataFrame(list_of_rows)
df.to_csv('scrapedata.csv', encoding='utf-8')

Python：字符编码问题

1 个答案: