解析html表并在python中编写一个文本文件

时间:2014-04-18 16:44:00

标签: python web-scraping beautifulsoup html-table fwrite

我在python中使用BS4解析一个html表。一切正常,我能够识别出我需要的所有元素并打印出来。但程序停止工作然后我尝试将结果写入文本文件。我收到这个错误:

  

“UnicodeEncodeError:'ascii'编解码器无法对位置13中的字符u'\ xa0'进行编码:序数不在范围内(128)”

我曾尝试在写入命令中使用.encode('utf-8'),但我写的是这样的:31.61†

这就是我正在运行的。我使用代码结构来解析另一个表,它工作。如果有人能指出我正确的方向,我感激不尽。

from threading import Thread
import urllib2
import re
from bs4 import BeautifulSoup


url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan" 
myfile = open('base/basei/' + url[57:].replace("%20", " ").replace("%27","'") + '.txt','w+')
soup = BeautifulSoup(urllib2.urlopen(url).read())  
for tr in soup.find_all('tr')[0:]:
  tds = tr.find_all('td')
  if len(tds) >=0:
    print tds[0].text, ",", tds[4].text, ",", tds[7].text, ",", tds[12].text, ",", tds[14].text, ",", tds[17].text
    myfile.write(tds[0].text + ','+ tds[4].text + "," + tds[7].text + "," + tds[12].text + "," + tds[14].text + "," + tds[17].text)

myfile.close() 

1 个答案:

答案 0 :(得分:1)

以下代码对我有用。我用逗号替换了不间断的空格;这样您就可以直接将输出用作CSV(例如,您可以轻松读入Excel或LibreOffice Calc)。

import urllib2                                                                  
from bs4 import BeautifulSoup                                                   

url = "http://trackinfo.com/dog-racelines.jsp?page=1&runnername=Ww%20Gloriaestefan"
soup = BeautifulSoup(urllib2.urlopen(url).read())                               

with open('out.txt', 'w') as myfile:                                           
    for tr in soup.find_all('tr')[0:]:
        tds = tr.find_all('td')
        if len(tds) >= 0:
            stripped_tds = [tds[x].text.strip() for x in (0, 4, 7, 12, 14, 17)]
            out = ','.join(stripped_tds)
            out = out.replace(u'\xa0', ',')
            print out
            myfile.write(out + '\n')

with语句不需要显式调用myfile.close()。当with内的代码部分完成时,即使遇到异常,它也会隐式执行此操作。 )

out.txt的内容:

2014-04-15,E5,31.28,7,6,32.18,C
2014-04-13,E6,31.07,2,4,31.64,B
2014-04-11,E6,31.21,6,6,32.53,B
2014-04-07,E7,30.93,5,7,32.31,B
2014-04-03,S1,30.82,3,2,31.23,
2014-03-30,E9,31.02,3,8,31.97,A
2014-03-28,E9,30.95,7,8,31.85,A
2014-03-23,E9,30.88,8,8,32.06,A
2014-03-21,E6,30.83,1,1,30.83,SB
2014-03-17,E5,31.14,1,1,31.14,C
2014-03-15,E5,31.00,4,4,31.62,C
2014-03-10,E3,31.46,4,1,31.46,D
2014-03-08,A3,31.79,4,5,32.23,D
2014-03-03,A6,31.20,3,5,31.81,D
2014-03-01,E3,31.61,3,3,31.88,D