我有一个HTML页面,我想找到它的一些项目。 我发现很难应用beautifulsoup或lxml
HTML页面:
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
如何在列表中的不同字典中获取所有这3个ID和标题
[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]
答案 0 :(得分:0)
您需要的所有标题和ID都位于<span>
标记内,并带有class="episode"
属性。因此,您的工作是迭代所有这些标记,并获取'data-id'
标记的div
和内部text
标记的span
。
代码:
html = '''
<li class="context-card">
<div class="episode" data-id="t1">
<span class="av-play">Title to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t2">
<span class="av-play">Title2 to scrape</span>
</div>
</li>
<li class="context-card">
<div class="episode" data-id="t3">
<span class="av-play">Title3 to scrape</span>
</div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')
title_list = []
for ep in soup.find_all('div', class_='episode'):
curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
title_list.append(curr_dict)
print(title_list)
输出:
[{'id': 't1', 'title': 'Title to scrape'},
{'id': 't2', 'title': 'Title2 to scrape'},
{'id': 't3', 'title': 'Title3 to scrape'}]
或者,使用列表理解可以完成同样的事情:
title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]