Question

我正在尝试从HTML元素中提取内容，该元素位于具有特定内容“ID”的元素之后。

例如，在下面的data-tip属性的内容中，我想在所有情况下从ID标记后面的元素中提取内容1886G。

我在python中使用beautifulsoup4进行解析，一次根据id识别内容，另一种用于将数据提示内容字符串解析回html。我试图使用findNextSibling（）来获取ID，如下所示：

import os
import re
from bs4 import BeautifulSoup


html_file = BeautifulSoup(open("data_sample.html"), "html.parser")

for tag in html_file.findAll(id = re.compile("^content.*")):
    dataTip = BeautifulSoup(tag["data-tip"], "html.parser")
    print("find ID:")
    print(dataTip.findNextSibling("tr", attrs = {"th" : "ID"}))

输出

find ID:
None

以下是一个示例元素：

<div id="content_placement_o_89879879789" style="z-index: 77; position: absolute; width: 25px; height: 43px; left: 124.0px; top: 344.0px;" data-tip="<table width='200'>
<tr>
<th>Name</th>
<td>Generic Phone Name</td>
</tr>
<tr>
<th>ID</th>
<td>1886G</td>
</tr>
<tr>
<th>Status</th>
<td>Same</td>
</tr>
</table>
">
<img alt="Image" class="same_mark_10987024  same_mark_highlighted" height="43" id="s_o_848483938748" src="https://website/picture.gif" style="position: absolute" width="25">
</div>

显然，我错过了关于这个功能是如何工作的。有谁知道我可以改变什么来完成这项任务？

Answer 1

您需要在findNextSibling标记上th拨打ID，该标记的文字为tr，而不是与您要查找的标记具有父子关系的th，或者更明确地说，td和tr是th的孩子，td和import re for tag in html_file.findAll(id = re.compile("^content.*")): dataTip = BeautifulSoup(tag["data-tip"], "html.parser") id = dataTip.find("th", text = "ID").findNextSibling().text print(id) # 1886G是兄弟姐妹：

notifyDataSetChanged();

在给定内容的元素之后解析HTML元素

1 个答案: