无法通过BeautifulSoup / LXML解析HTML

时间:2018-04-03 10:59:04

标签: python python-2.7 beautifulsoup lxml

我有一个HTML页面,我想找到它的一些项目。 我发现很难应用beautifulsoup或lxml

HTML页面:

<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>

如何在列表中的不同字典中获取所有这3个ID和标题

[{'id':'t1', 'title': 'Title to scrape'}, {'id':'t2', 'title': 'Title2 to scrape'}, {'id':'t3', 'title': 'Title3 to scrape'}]

1 个答案:

答案 0 :(得分:0)

您需要的所有标题和ID都位于<span>标记内,并带有class="episode"属性。因此,您的工作是迭代所有这些标记,并获取'data-id'标记的div和内部text标记的span

代码:

html = '''
<li class="context-card">
    <div class="episode" data-id="t1">
        <span class="av-play">Title to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t2">
        <span class="av-play">Title2 to scrape</span>
    </div>
</li>
<li class="context-card">
    <div class="episode" data-id="t3">
        <span class="av-play">Title3 to scrape</span>
    </div>
</li>
'''
soup = BeautifulSoup(html, 'lxml')

title_list = []
for ep in soup.find_all('div', class_='episode'):
    curr_dict = {'id': ep['data-id'], 'title': ep.span.text}
    title_list.append(curr_dict)

print(title_list)

输出:

[{'id': 't1', 'title': 'Title to scrape'},
 {'id': 't2', 'title': 'Title2 to scrape'},
 {'id': 't3', 'title': 'Title3 to scrape'}]

或者,使用列表理解可以完成同样的事情:

title_list = [{'id': ep['data-id'], 'title': ep.span.text} for ep in soup.find_all('div', class_='episode')]