我需要有关网页抓取的帮助。这是html示例:
<div class="content" name="content-name">
<h2 class="Topic">First Topic</h2>
<ul>
<li>This Data 1</li>
<li>This Data 2</li>
<li>This Data 3</li>
</ul>
<h2 class="Topic">Second Topic</h2>
<ul>
<li>That Data 1</li>
<li>That Data 2</li>
<li>That Data 3</li>
</ul>
<h2 class="Topic">Third Topic</h2>
<ul>
<li>Their Data 1</li>
<li>Their Data 2</li>
<li>Their Data 3</li>
</ul>
</div>
使用BeautifulSoup,我可以获取name =“ content-name”的html div标签。但是,如何在 h2 标记之后的 ul 标记内包含 li 标记的所有带有“第二主题”文本的文本?因为所有这些都在同一个div标记中,没有特定的类,id或名称。 预先感谢。
答案 0 :(得分:1)
当标签没有ID或类或父标签时,总是更加困难。
That Data 1
That Data 2
That Data 3
返回
@OnClick(R.id.btnLogin)
public void onBtnLoginClicked() {
...
if (AndroidUtils.isInternetAvailable(context)) {
...
Navigation.createNavigateOnClickListener(R.id.action_LoginF_to_MainF, null);
}
}
答案 1 :(得分:1)
from bs4 import BeautifulSoup
src = """
<div class="content" name="content-name">
<h2 class="Topic">First Topic</h2>
<ul>
<li>This Data 1</li>
<li>This Data 2</li>
<li>This Data 3</li>
</ul>
<h2 class="Topic">Second Topic</h2>
<ul>
<li>That Data 1</li>
<li>That Data 2</li>
<li>That Data 3</li>
</ul>
<h2 class="Topic">Third Topic</h2>
<ul>
<li>Their Data 1</li>
<li>Their Data 2</li>
<li>Their Data 3</li>
</ul>
</div>
"""
soup = BeautifulSoup(src, 'lxml')
content = soup.find_all("div", class_="content")[0]
second_topic = content.find_all("h2", class_="Topic", string="Second Topic")[0]
ul = second_topic.next_sibling.next_sibling
li = ul.find_all("li")
for i in li:
print(i.string)