我有一个循环遍历多个网页的脚本,但有一个小问题我坚持。我正在尝试将作者添加到列表中,但我的脚本从页面中提取最后一位作者并将其应用于每个作者字段。如何让我的脚本将每个作者应用到相关标题?这是我的代码
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
import json
base_url = "https://archive.org/details/librivoxaudio?&sort=titleSorter"
data = []
n = 5
for i in range(1, n+1):
response = urlopen(base_url + "&page=" + str(i))
page_html = response.read()
response.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs info for each book
containers = page_soup.findAll("div",{"class":"item-ttl"})
authors = page_soup.findAll("span",{"class":"byv"})
for container in containers:
item = {}
item['type'] = "Public Domain Audiobook"
item['title'] = container.text.lstrip().strip()
for author in authors:
item['author'] = author.text
item['link'] = "https://archive.org/" + container.a["href"]
item['source'] = "LibriVox"
item['base_url'] = "https://librivox.org/"
data.append(item) # add the item to the list
with open("./json/librivoxTest.json", "w") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
以下是JSON
中输出的示例{
"type": "Public Domain Audiobook",
"title": "A Book of Old English Ballads",
"author": "Charles Whibley",
"link": "https://archive.org//details/book_old_english_ballads_1007_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
}, {
"type": "Public Domain Audiobook",
"title": "A Book of Scoundrels",
"author": "Charles Whibley",
"link": "https://archive.org//details/scoundrels_1712_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
}
最后一位作者对“恶棍之书”是正确的,但“旧英国歌谣之书”应该以乔治沃顿爱德华兹为作者。
答案 0 :(得分:1)
我认为以下脚本将解决您遇到的问题。我试图以一种有条理的方式制作它。
from urllib.request import urlopen
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import json
urls = ["https://archive.org/details/librivoxaudio?&sort=titleSorter&page={}".format(page) for page in range(1,3)]
for link in urls:
soup = BeautifulSoup(urlopen(link).read(), "html.parser")
data = []
for container in soup.select("div[data-id$='_librivox']"):
item = {}
item['type'] = "Public Domain Audiobook"
item['title'] = container.select_one(".ttl").get_text(strip=True)
item['author'] = container.select_one(".byv").get_text(strip=True) if container.select_one(".byv") else ""
item['link'] = urljoin(link, container.select_one("a[title]")['href']) if container.select_one("a[title]") else ""
item['source'] = "LibriVox"
item['base_url'] = "https://librivox.org/"
data.append(item)
print(json.dumps(data,indent=4))
输出如下:
[
{
"type": "Public Domain Audiobook",
"title": "\"BOOH!\"",
"author": "Eugene Field",
"link": "https://archive.org/details/booh_1403.poem_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
},
{
"type": "Public Domain Audiobook",
"title": "\"You Bid Me Try\"",
"author": "Henry Austin Dobson",
"link": "https://archive.org/details/youbid_metry_1104_librivox",
"source": "LibriVox",
"base_url": "https://librivox.org/"
},
答案 1 :(得分:0)
for author in authors:
item['author'] = author.text
循环遍历所有作者,并将它们设置为项目的作者。最后一位作者将在最后设置该项目。
要设置相应的作者,要么在作者上使用生成器(authors_iterator = iter(authors)
,然后在项目中设置next(authors_iterator)
),要么使用枚举来循环容器,并使用其索引作者。