使用BeautifulSoup解析列并保存为JSON

时间:2016-04-16 12:08:55

标签: python html json beautifulsoup bs4

我想在网站上解析Afk。,Aantal和Zetels列:http://www.nlverkiezingen.com/TK2012.html我最终可以保存为JSON文件。

在将其保存为json文件之前,我需要解析这些元素。

我有

from bs4 import BeautifulSoup
import urllib

jaren = [str("2010"), str("2012")]

for Jaargetal in jaren:
    r = urllib.urlopen("http://www.nlverkiezingen.com/TK" + Jaargetal +".html").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        header = soup.find_all("h1")[0].getText()
        print header

        trs = table.find_all("tr")[0].getText()
        print '\n'
        for tr in table.find_all("tr"): 
              print "|".join([x.get_text().replace('\n','') for x in tr.find_all('td')])

我试过

from bs4 import BeautifulSoup
import urllib

jaren = [str("2010"), str("2012")]

for Jaargetal in jaren:
    r = urllib.urlopen("http://www.nlverkiezingen.com/TK" + Jaargetal +".html").read()
    soup = BeautifulSoup(r, "html.parser")
    tables = soup.find_all("table")

    for table in tables:
        header = soup.find_all("h1")[0].getText()
        print header

        for tr in  table.find_all("tr"):
            firstTd = tr.find("td")
            if firstTd and firstTd.has_attr("class") and "l" in firstTd['class']:
                tds = tr.find_all("td")

                for tr in table.find_all("tr"): 
                    print "|".join([x.get_text().replace('\n','') for x in tr.find_all('td')])
                    break

我做错了什么或我该做什么,我是否走在正确的轨道上?

1 个答案:

答案 0 :(得分:0)

仅提取所需列的一个选项是检查列的索引。定义您感兴趣的列索引:

DESIRED_COLUMNS = {1, 2, 5}  # it is a set

然后将enumerate()find_all()

一起使用
"|".join([x.get_text().replace('\n', '') 
          for index, x in enumerate(tr.find_all('td')) 
          if index in DESIRED_COLUMNS])