我正在使用BeautifulSoup抓取一些数据,并希望将该数据写入json文件。我已经能够编写脚本来将数据保存到json文件,但它只保存页面上的最后一项,并且不会遍历所有结果。它打印出终端中的每个结果。我不确定我错过了什么。这是我的代码
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
otl_url = 'https://open.umn.edu/opentextbooks/SearchResults.aspx?subjectAreaId=99'
#opening up connection and grabbing page
uClient = urlopen(otl_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each textbook
containers = page_soup.findAll("div",{"class":"twothird"})
data = {}
for container in containers:
data['title'] = container.h2.text
data['author'] = container.p.text
data['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]
print("title: " + data['title'])
print("author: " + data['author'])
print("link: " + data['link'])
with open("textbooks.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
答案 0 :(得分:0)
您将数据存储在dict
中,并且只能包含一个同名的密钥。如果要存储多个,例如:
data = []
for container in containers:
data.append({"title": container.h2.text, "author": container.p.text,
"link": "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]})
with open("textbooks.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
答案 1 :(得分:0)
在for
循环中,这一行:
data['title'] = container.h2.text
data['author'] = container.p.text
data['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]
在循环的每次迭代中重置字典的值。我建议你做的是让它们成为这样的清单:
data['title'] = []
data['author'] = []
data['link'] = []
然后在你的for循环中
data["title"].append(container.h2.text)
data["author"].append(container.p.text)
data["link"].append("https://open.umn.edu/opentextbooks/" + container.h2.a["href"])
将保存所有找到的容器,您应该看到JSON文件中的所有内容。
希望这有帮助!
答案 2 :(得分:0)
这是因为您在循环的每次迭代中重新分配data
对象。你可能想要更像这样的东西:
data = [] # create a list to store the items
for container in containers:
item = {}
item['title'] = container.h2.text
item['author'] = container.p.text
item['link'] = "https://open.umn.edu/opentextbooks/" + container.h2.a["href"]
data.append(item) # add the item to the list
print("title: " + item['title'])
print("author: " + item['author'])
print("link: " + item['link'])
with open("textbooks.json", "w") as writeJSON:
json.dump(items, writeJSON, ensure_ascii=False)