Question

我正在尝试提取世界杯组表数据。这是我的代码：

from bs4 import BeautifulSoup
import requests
url ="http://www.uefa.com/worldcup/season=2014/standings/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
table = soup.find('table')
rows = table.findAll('tr')
data = [[td.text.encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.encode("utf-8") for th in tr.findAll("th")] for tr in rows]
print head
for i in data:
    print str(i)

一切正常，但我在输出中得到一些奇怪的字符：

[['', 'Teams', 'P', 'W', 'D', 'L', 'F', 'A', '+/-', 'Pts'], [], [], [], []]
[]
['0', '\xc2\xa0Brazil', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Croatia', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Mexico', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Cameroon', '0', '0', '0', '0', '0', '0', '0', '0']

如何解决？

Answer 1

使用data = [[td.text.strip() for td in tr.findAll("td")] for tr in rows]

要使用字符串：

data = [[str(td.text.strip()) for td in tr.findAll("td")] for tr in rows]
head = [[str(th.text) for th in tr.findAll("th")] for tr in rows]

Answer 2

您的文字结果包含U+00A0 NO-BREAK SPACE个字符，以C2 A0字节编码为UTF-8。

如果您想删除这些内容，请在编码之前剥离文本：

data = [[td.text.strip().encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.strip().encode("utf-8") for th in tr.findAll("th")] for tr in rows]

不间断空格被视为空格，与str.strip()方法一样，常规空格会删除这些空格：

>>> '\xc2\xa0Cameroon'.decode('utf8')
u'\xa0Cameroon'
>>> '\xc2\xa0Cameroon'.decode('utf8').strip()
u'Cameroon'

如何避免提取的表数据中不需要的字符？

2 个答案: