我正在从遵循相同结构的多个div中搜集标题,描述,链接和人名。我正在使用BeautifulSoup,我能够从第一个div中删除所有东西。但是,我无法从我的长列表中删除数据,并以CSV或JSON等便携式格式获取数据。
如何从长长的div列表中删除每个项目,并将这些信息存储在每个mp3的JSON对象中?
div看起来像这样:
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ <a href="link1.mp3">Right-click to download</a>] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ <a href="link2.mp3">Right-click to download</a>] </div>
</div>
我已经想出如何从第一个div中刮掉,但我无法获取每个div的信息。例如,我的下面的代码只反复吐出第一个div的h3。
我知道我可以为标题,描述等创建一个python列表,但是如何保持元数据结构如JSON,以便title1,link1和description1保持在一起,以及title2的信息。
with open ('soup.html', 'r') as myfile:
html_doc = myfile.read()
soup = BeautifulSoup(html_doc, 'html.parser')
audio_div = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
print len(audio_div)
#create dictionary for storing scraped data. I don't know how to store the values for each mp3 separately.
for i in audio_div:
print soup.find('h3').text
我希望我的JSON看起来像这样:
{
"podcasts":[
{
"title":"title1",
"description":"description1",
"link":"link1"
},
{
"title":"title2",
"description":"description2",
"link":"link2"
}
]
}
答案 0 :(得分:3)
迭代每个曲目并进行特定于上下文的搜索:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<div>
<div class="audioBoxWrap clearBoth">
<h3>Title 1</h3>
<p>Description 1</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ <a href="link1.mp3">Right-click to download</a>] </div>
</div>
<div class="audioBoxWrap clearBoth">
<h3>Title 2</h3>
<p>Description 2</p>
<div class="info" style="line-height: 1px; height: 1px; font-size: 1px;"></div>
<div class="audioBox" style="display: none;">
stuff
</div>
<div> [ <a href="link2.mp3">Right-click to download</a>] </div>
</div>
</div>"""
soup = BeautifulSoup(data, "html.parser")
tracks = soup.find_all('div', {'class':"audioBoxWrap clearBoth"})
result = {
"podcasts": [
{
"title": track.h3.get_text(strip=True),
"description": track.p.get_text(strip=True),
"link": track.a["href"]
}
for track in tracks
]
}
pprint(result)
打印:
{'podcasts': [{'description': 'Description 1',
'link': 'link1.mp3',
'title': 'Title 1'},
{'description': 'Description 2',
'link': 'link2.mp3',
'title': 'Title 2'}]}