我正在尝试提取世界杯组表数据。这是我的代码:
from bs4 import BeautifulSoup
import requests
url ="http://www.uefa.com/worldcup/season=2014/standings/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
table = soup.find('table')
rows = table.findAll('tr')
data = [[td.text.encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.encode("utf-8") for th in tr.findAll("th")] for tr in rows]
print head
for i in data:
print str(i)
一切正常,但我在输出中得到一些奇怪的字符:
[['', 'Teams', 'P', 'W', 'D', 'L', 'F', 'A', '+/-', 'Pts'], [], [], [], []]
[]
['0', '\xc2\xa0Brazil', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Croatia', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Mexico', '0', '0', '0', '0', '0', '0', '0', '0']
['0', '\xc2\xa0Cameroon', '0', '0', '0', '0', '0', '0', '0', '0']
如何解决?
答案 0 :(得分:2)
使用data = [[td.text.strip() for td in tr.findAll("td")] for tr in rows]
要使用字符串:
data = [[str(td.text.strip()) for td in tr.findAll("td")] for tr in rows]
head = [[str(th.text) for th in tr.findAll("th")] for tr in rows]
答案 1 :(得分:2)
您的文字结果包含U+00A0 NO-BREAK SPACE个字符,以C2 A0
字节编码为UTF-8。
如果您想删除这些内容,请在编码之前剥离文本:
data = [[td.text.strip().encode("utf-8") for td in tr.findAll("td")] for tr in rows]
head = [[th.text.strip().encode("utf-8") for th in tr.findAll("th")] for tr in rows]
不间断空格被视为空格,与str.strip()
方法一样,常规空格会删除这些空格:
>>> '\xc2\xa0Cameroon'.decode('utf8')
u'\xa0Cameroon'
>>> '\xc2\xa0Cameroon'.decode('utf8').strip()
u'Cameroon'