我在python中使用BS4解析一个html表。一切正常,我能够识别出我需要的所有元素并打印出来。但程序停止工作然后我尝试将结果写入文本文件。我收到这个错误:
“UnicodeEncodeError:'ascii'编解码器无法对位置13中的字符u'\ xa0'进行编码:序数不在范围内(128)”
我曾尝试在写入命令中使用.encode('utf-8'),但我写的是这样的:31.61†
这就是我正在运行的。我使用代码结构来解析另一个表,它工作。如果有人能指出我正确的方向,我感激不尽。
from threading import Thread
import urllib2
import re
from bs4 import BeautifulSoup
url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
myfile = open('base/basei/' + url[57:].replace("%20", " ").replace("%27","'") + '.txt','w+')
soup = BeautifulSoup(urllib2.urlopen(url).read())
for tr in soup.find_all('tr')[0:]:
tds = tr.find_all('td')
if len(tds) >=0:
print tds[0].text, ",", tds[4].text, ",", tds[7].text, ",", tds[12].text, ",", tds[14].text, ",", tds[17].text
myfile.write(tds[0].text + ','+ tds[4].text + "," + tds[7].text + "," + tds[12].text + "," + tds[14].text + "," + tds[17].text)
myfile.close()
答案 0 :(得分:1)
以下代码对我有用。我用逗号替换了不间断的空格;这样您就可以直接将输出用作CSV(例如,您可以轻松读入Excel或LibreOffice Calc)。
import urllib2
from bs4 import BeautifulSoup
url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
soup = BeautifulSoup(urllib2.urlopen(url).read())
with open('out.txt', 'w') as myfile:
for tr in soup.find_all('tr')[0:]:
tds = tr.find_all('td')
if len(tds) >= 0:
stripped_tds = [tds[x].text.strip() for x in (0, 4, 7, 12, 14, 17)]
out = ','.join(stripped_tds)
out = out.replace(u'\xa0', ',')
print out
myfile.write(out + '\n')
(with
语句不需要显式调用myfile.close()
。当with
内的代码部分完成时,即使遇到异常,它也会隐式执行此操作。 )
out.txt
的内容:
2014-04-15,E5,31.28,7,6,32.18,C
2014-04-13,E6,31.07,2,4,31.64,B
2014-04-11,E6,31.21,6,6,32.53,B
2014-04-07,E7,30.93,5,7,32.31,B
2014-04-03,S1,30.82,3,2,31.23,
2014-03-30,E9,31.02,3,8,31.97,A
2014-03-28,E9,30.95,7,8,31.85,A
2014-03-23,E9,30.88,8,8,32.06,A
2014-03-21,E6,30.83,1,1,30.83,SB
2014-03-17,E5,31.14,1,1,31.14,C
2014-03-15,E5,31.00,4,4,31.62,C
2014-03-10,E3,31.46,4,1,31.46,D
2014-03-08,A3,31.79,4,5,32.23,D
2014-03-03,A6,31.20,3,5,31.81,D
2014-03-01,E3,31.61,3,3,31.88,D