Question

我正在尝试解析网页以获取标题或粗体文本下方的文本。对于包含如下代码部分代码的网页，我希望获得在粗体标记之后但在h3标记之前的文本。

这与标签的内部文本不同。我不希望获得文字“教练姓名”，我希望获得教授的详细信息 - 姓名，指定，办公时间。

    ....
    <bold>Name of instructor</bold>:
    Dr. A. B. C<br />
    Professor, Dept. of Alphabet<br />
    Office hours: M, T 8:00am-10:00am<br />

    <h3>Course Name</h3>:
    Introduction to Alphabet

    <h4>Course timings</h4>
    Monday 4:00-6:00 pm
    Tuesday 5:00-6:00 pm
    ....

我正在使用BeautifulSoup来解析网页。我尝试使用 .next_sibling ，但这适用于具有相同名称的标记，例如粗体粗体或h3到h3。 .next 提供下一个元素，而不是下一个标记，可以是 br 或 p

如果有任何我可以添加澄清的内容，请告诉我。

Answer 1

我正在使用BS3。此代码沿着nextSibling进行迭代，直到它检测到非自动关闭标记（如<br />），收集所有找到的NavigableString。

from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString

txt = \
'''
<bold>Name of instructor</bold>:
Dr. A. B. C<br />
Professor, Dept. of Alphabet<br />
Office hours: M, T 8:00am-10:00am<br />

<h3>Course Name</h3>:
Introduction to Alphabet

<h4>Course timings</h4>
Monday 4:00-6:00 pm
Tuesday 5:00-6:00 pm

'''

pool = BeautifulStoneSoup(txt, selfClosingTags=['br'])
found_txt = []
for x in pool.find("bold").nextSiblingGenerator():
    if isinstance(x, Tag) and not x.isSelfClosing:
        break
    elif isinstance(x, NavigableString):
        found_txt.append(x)
print found_txt

鉴于你有完整的HTML（我推测），你不需要使用StoneSoup，只需要普通的汤。

使用BeautifulSoup获取HTML标记后的文本

1 个答案: