我想学习如何使用BeautifulSoup刮取页面并将其写入csv文件。当我开始将列附加到字典中的键时,所有值都附加到每个键而不仅仅是一个键。
我得到了我想要的信息:
[<td class="column-2">655</td>]
[<td class="column-2">660</td>]
[<td class="column-2">54</td>]
[<td class="column-2">241</td>]
之后,当我尝试将每个值分配给一个键时,我得到:
{'date': ['14th November 2016'], 'total complaints': ['655', '660', '54', '241'], 'complaints': ['655', '660', '54', '241'], 'departures': ['655', '660', '54', '241'], 'arrivals': ['655', '660', '54', '241']}
完整代码(csv writer仅用于测试):
import requests
from bs4 import BeautifulSoup as BS
import csv
operational_data_url = "http://heathrowoperationaldata.com/daily-operational-data/"
operational_data_page = requests.get(operational_data_url).text
print(operational_data_page)
soup = BS(operational_data_page, "html.parser")
data_div = soup.find_all("ul", class_="sub-menu")
list_items = data_div[0].find_all("li")
data_links = []
for menu in data_div:
list_items = menu.find_all("li")
for links in list_items:
data_link = links.find("a")
data_links.append(data_link.get("href"))
for page in data_links[:1]:
data_page = requests.get(page).text
soup = BS(data_page, "html.parser")
date = soup.find("title")
table = soup.find("tbody")
data = {
"date" : [],
"arrivals" : [],
"departures" : [],
"complaints" : [],
"total complaints" : [],
}
for day in date:
data["date"].append(day)
rows = table.find_all("tr", class_=["row-3", "row-4", "row-36", "row-37"])
for row in rows:
cols = row.find_all("td", class_="column-2")
data["arrivals"].append( cols[0].get_text() )
data["departures"].append( cols[0].get_text() )
data["complaints"].append( cols[0].get_text() )
data["total complaints"].append( cols[0].get_text() )
#test
with open('test.csv', 'w') as test_file:
fields = ['date', 'arrivals', 'departures', 'complaints', 'total complaints']
writer = csv.DictWriter(test_file, fields)
writer.writeheader()
row = {'date': day, 'arrivals': 655, 'departures': 660, 'complaints': 54, 'total complaints': 241 }
writer.writerow(row)
感谢您的帮助!
答案 0 :(得分:1)
当我开始将列附加到字典中的键时,所有值都附加到每个键而不仅仅是一个键。
目前,您的for row in rows:
循环明确地执行此操作。
在我看来,你想要做这样的事情:
rows = table.find_all("tr", class_=["row-3", "row-4", "row-36", "row-37"])
cols = [row.find_all("td", class_="column-2")[0] for row in rows]
data["arrivals"].append(cols[0].get_text())
data["departures"].append(cols[1].get_text())
data["complaints"].append(cols[2].get_text())
data["total complaints"].append(cols[3].get_text())
这将为data
:
{'date': [u'14th November 2016'], 'complaints': [u'54'], 'total complaints': [u'241'], 'departures': [u'660'], 'arrivals': [u'655']}
请注意,这仅适用于rows
顺序正确的情况。