如何避免提取的表数据中不需要的字符?

时间:2014-06-02 13:36:34

标签: python beautifulsoup

我正在尝试提取世界杯组表数据。这是我的代码:

from bs4 import BeautifulSoup
import requests
url ="http://www.uefa.com/worldcup/season=2014/standings/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
table = soup.find('table')
rows = table.findAll('tr')
data = [[td.text.encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.encode("utf-8") for th in tr.findAll("th")] for tr in rows]
print head
for i in data:
    print str(i)

一切正常,但我在输出中得到一些奇怪的字符:

[['', 'Teams', 'P', 'W', 'D', 'L', 'F', 'A', '+/-', 'Pts'], [], [], [], []]
[]
['0', '\xc2\xa0Brazil', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Croatia', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Mexico', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Cameroon', '0', '0', '0', '0', '0', '0', '0', '0']

如何解决?

2 个答案:

答案 0 :(得分:2)

使用data = [[td.text.strip() for td in tr.findAll("td")] for tr in rows]

要使用字符串:

data = [[str(td.text.strip()) for td in tr.findAll("td")] for tr in rows]
head = [[str(th.text) for th in tr.findAll("th")] for tr in rows]

答案 1 :(得分:2)

您的文字结果包含U+00A0 NO-BREAK SPACE个字符,以C2 A0字节编码为UTF-8。

如果您想删除这些内容,请在编码之前剥离文本:

data = [[td.text.strip().encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.strip().encode("utf-8") for th in tr.findAll("th")] for tr in rows]

不间断空格被视为空格,与str.strip()方法一样,常规空格会删除这些空格:

>>> '\xc2\xa0Cameroon'.decode('utf8')
u'\xa0Cameroon'
>>> '\xc2\xa0Cameroon'.decode('utf8').strip()
u'Cameroon'