Question

我正在尝试收集同一级别的两个代码之间的内容，在这种情况下，下面两个h2代码之间的内容：

<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>

理想情况下，我希望输出如下（即理想情况下，<th>中的文本会被忽略，但我很好，因为它坚持不懈）：

Plan for and be active in your own learning...
Reflect on your knowledge of teaching and yourself...
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

这是我到目前为止所拥有的;

soup = BeautifulSoup(text)
output = ""
unitLO = soup.find(id="learning-outcomes")
tagBreak = unitLO.name
if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            output += str(tag)

print(output)

它给出了以下输出，这是一个字符串;

>>> type(output)
<class 'str'>
>>>


<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>

这与我想要的有些不同......

我提出的唯一解决方案是推动output进行另一轮BeautifulSoup解析：

>>> moresoup = BeautifulSoup(output)
>>> for str in moresoup.strings:
...     print(str)
...






On successful completion of this unit, you will beableto:












Plan for and be active in your own learning...


Reflect on your knowledge of yourself....


Articulate your informed understanding of the foundations...


Demonstrate information literacy skills


Communicate in writing for an academic audience










>>>

真的不优雅，导致很多空白（当然很容易清理）。

有关更优雅的方式的任何想法？

非常感谢！

Answer 1

尝试使用soup.find_all获取所有p代码

<强>实施例

from bs4 import BeautifulSoup
s = """<h2 id="learning-outcomes">Learning Outcomes</h2>
<table>
<thead>
<tr class="header">
<th>On successful completion of this unit, you will beable to:</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><ol type="1">
<li><p>Plan for and be active in your own learning...</p></li>
<li><p>Reflect on your knowledge of yourself....</p></li>
<li><p>Articulate your informed understanding of the foundations...</p></li>
<li><p>Demonstrate information literacy skills</p></li>
<li><p>Communicate in writing for an academic audience</p></li>
</ol></td>
</tr>
</tbody>
</table>
<h2 id="prior-knowledge">Prior knowledge</h2>"""

soup = BeautifulSoup(s, "html.parser")
for p in soup.find(id="learning-outcomes").findNext("table").find_all("p"):
    print(p.text)

<强>输出：

Plan for and be active in your own learning...
Reflect on your knowledge of yourself....
Articulate your informed understanding of the foundations...
Demonstrate information literacy skills
Communicate in writing for an academic audience

Answer 2

更改以下代码

if unitLO:
    # we will loop until we hit the next tag with the same name as the
    # matched tag. eg if unitLO matches an H3, then all content up till the
    # next H3 is captured.
    for tag in unitLO.next_siblings:
        if tag.name == tagBreak:
            break
        else:
            if str(tag).strip() != "":
                output += str(tag)

print(output)

在同一兄弟级别提取两个标签之间的内容

2 个答案: