Python BeautifulSoupHTML表格抓取

时间:2016-11-30 12:31:40

标签: python-3.x beautifulsoup

我想学习如何使用BeautifulSoup刮取页面并将其写入csv文件。当我开始将列附加到字典中的键时,所有值都附加到每个键而不仅仅是一个键。

我得到了我想要的信息:

[<td class="column-2">655</td>]
[<td class="column-2">660</td>]
[<td class="column-2">54</td>]
[<td class="column-2">241</td>] 

之后,当我尝试将每个值分配给一个键时,我得到:

{'date': ['14th November 2016'], 'total complaints': ['655', '660', '54', '241'], 'complaints': ['655', '660', '54', '241'], 'departures': ['655', '660', '54', '241'], 'arrivals': ['655', '660', '54', '241']}

完整代码(csv writer仅用于测试):

import requests
from bs4 import BeautifulSoup as BS
import csv

operational_data_url = "http://heathrowoperationaldata.com/daily-operational-data/"
operational_data_page = requests.get(operational_data_url).text
print(operational_data_page)

soup = BS(operational_data_page, "html.parser")

data_div = soup.find_all("ul", class_="sub-menu")

list_items = data_div[0].find_all("li")

data_links = []
for menu in data_div:
    list_items = menu.find_all("li")
    for links in list_items:
        data_link = links.find("a")
        data_links.append(data_link.get("href"))

for page in data_links[:1]:
    data_page = requests.get(page).text

soup = BS(data_page, "html.parser")
date = soup.find("title")
table = soup.find("tbody")

data = {
    "date" : [],
    "arrivals" : [],
    "departures" : [],
    "complaints" : [],
    "total complaints" : [],    
}

for day in date:
    data["date"].append(day)

rows = table.find_all("tr", class_=["row-3", "row-4", "row-36", "row-37"])
for row in rows:
    cols = row.find_all("td", class_="column-2")
    data["arrivals"].append( cols[0].get_text() )
    data["departures"].append( cols[0].get_text() )
    data["complaints"].append( cols[0].get_text() )
    data["total complaints"].append( cols[0].get_text() )

#test
with open('test.csv', 'w') as test_file:

    fields = ['date', 'arrivals', 'departures', 'complaints', 'total complaints']

    writer = csv.DictWriter(test_file, fields)
    writer.writeheader()

    row = {'date': day, 'arrivals': 655, 'departures': 660, 'complaints': 54, 'total complaints': 241 }
    writer.writerow(row) 

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

  

当我开始将列附加到字典中的键时,所有值都附加到每个键而不仅仅是一个键。

目前,您的for row in rows:循环明确地执行此操作。

在我看来,你想要做这样的事情:

rows = table.find_all("tr", class_=["row-3", "row-4", "row-36", "row-37"])
cols = [row.find_all("td", class_="column-2")[0] for row in rows]
data["arrivals"].append(cols[0].get_text())
data["departures"].append(cols[1].get_text())
data["complaints"].append(cols[2].get_text())
data["total complaints"].append(cols[3].get_text())

这将为data

提供以下结果
{'date': [u'14th November 2016'], 'complaints': [u'54'], 'total complaints': [u'241'], 'departures': [u'660'], 'arrivals': [u'655']}

请注意,这仅适用于rows顺序正确的情况。