Question

我试图解析表格 documentation并转入csv。

我使用了这段代码

import csv
import urllib2
from bs4 import BeautifulSoup

with open('litigation.csv', 'wb') as f:
  writer = csv.writer(f)
  for i in range(39):
    url = "http://www.sec.gov/divisions/enforce/friactions.shtml".format(i)
    u = urllib2.urlopen(url)
    try:
        html = u.read()
    finally:
        u.close()
    soup=BeautifulSoup(html)
    table = soup.find("table", {"cellspacing" : "7"})
    for tr in table.find_all('tr')[2:]:
        tds = tr.find_all('td')
        row = [elem.text.encode('utf-8') for elem in tds]
        writer.writerow(row)

但是当我使用它时，结果很奇怪。

我打算抓取所有表，但只有sep 4的内容才会记录在csv文件中。此外，内容被重复多次（我猜它大约是3次），并使行长于我的意图。

有人可以帮忙解决这个问题吗？

Answer 1

您遇到了differences between parsers：

In [8]: rows = BeautifulSoup(html, "html.parser").find("table", {"cellspacing" : "7"}).find_all('tr')[2:]

In [9]: len(rows)
Out[9]: 29

In [10]: rows = BeautifulSoup(html, "html5lib").find("table", {"cellspacing" : "7"}).find_all('tr')[2:]

In [11]: len(rows)
Out[11]: 97

In [12]: rows = BeautifulSoup(html, "lxml").find("table", {"cellspacing" : "7"}).find_all('tr')[2:]

In [13]: len(rows)
Out[13]: 97

换句话说，html.parser无法很好地处理此特定HTML，请使用html5lib或lxml。

另见：

Installing a parser

我在将HTML表解析为csv时遇到问题

1 个答案: