没有ID的网页抓取

时间:2018-10-20 14:06:05

标签: python html web-scraping beautifulsoup

我需要有关网页抓取的帮助。这是html示例:

<div class="content" name="content-name">
   <h2 class="Topic">First Topic</h2>
   <ul>
      <li>This Data 1</li>
      <li>This Data 2</li>
      <li>This Data 3</li>
   </ul>
   <h2 class="Topic">Second Topic</h2>
   <ul>
      <li>That Data 1</li>
      <li>That Data 2</li>
      <li>That Data 3</li>
   </ul>
   <h2 class="Topic">Third Topic</h2>
   <ul>
      <li>Their Data 1</li>
      <li>Their Data 2</li>
      <li>Their Data 3</li>
   </ul>
</div>

使用BeautifulSoup,我可以获取name =“ content-name”的html div标签。但是,如何在 h2 标记之后的 ul 标记内包含 li 标记的所有带有“第二主题”文本的文本?因为所有这些都在同一个div标记中,没有特定的类,id或名称。 预先感谢。

2 个答案:

答案 0 :(得分:1)

当标签没有ID或类或父标签时,总是更加困难。

您可以使用find_previous_sibling

That Data 1
That Data 2
That Data 3

返回

 @OnClick(R.id.btnLogin)
    public void onBtnLoginClicked() {
       ...
       if (AndroidUtils.isInternetAvailable(context)) {
          ...
          Navigation.createNavigateOnClickListener(R.id.action_LoginF_to_MainF, null);
       }
    }

答案 1 :(得分:1)

from bs4 import BeautifulSoup

src = """
<div class="content" name="content-name">
    <h2 class="Topic">First Topic</h2>
    <ul>
        <li>This Data 1</li>
        <li>This Data 2</li>
        <li>This Data 3</li>
    </ul>
    <h2 class="Topic">Second Topic</h2>
    <ul>
        <li>That Data 1</li>
        <li>That Data 2</li>
        <li>That Data 3</li>
    </ul>
    <h2 class="Topic">Third Topic</h2>
    <ul>
        <li>Their Data 1</li>
        <li>Their Data 2</li>
        <li>Their Data 3</li>
    </ul>
</div>
"""

soup = BeautifulSoup(src, 'lxml')

content = soup.find_all("div", class_="content")[0]


second_topic = content.find_all("h2", class_="Topic", string="Second Topic")[0]

ul = second_topic.next_sibling.next_sibling

li = ul.find_all("li")
for i in li:
    print(i.string)