Question

下面是一个html示例，但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每一个与其父标题（SUMMARY1）联系起来（标签）的通用方法是什么？这里的标题实际上不是标题标记，但它只是一个粗体文本。我试图提取和识别文本段落及其相应的标题部分，无论标题是否真的是标准标题或类似下面的内容：

<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location>Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>

我想尝试这样的JSON： {SUMMARY1：[＆＃39;这是一个气候良好的欧洲地区，＆＃39;当时使用的欧洲总人口估计数为＆＃39;]}

请指教。谢谢。

Answer 1

使用BeautifulSoup应该是这样的：

from bs4 import BeautifulSoup

html = 'your html'
soup = BeautifulSoup(html)
header = soup.find('b')
print(header.text)
first_paragraph = header.findNext('p')
print(first_paragraph.text)
second_paragraph = first_paragraph.findNext('p')
print(second_paragraph.text)

Answer 2

我最初考虑使用newspaper module，但未能找到将SUMMARY1作为＆＃34;摘要＆＃34;的唯一部分的方法。或＆＃34;描述＆＃34;或生成的Article对象上的任何其他位置。无论如何，请查看此模块 - 可能真的可以帮助您解析HTML文章。

但是，如果使用BeautifulSoup，您最初可能会找到标题，然后使用find_all_next()获取下一个p元素：

from bs4 import BeautifulSoup, NavigableString
import newspaper


html = """
<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>"""

soup = BeautifulSoup(html, "lxml")
header = soup.find("b")
parts = [p.get_text(strip=True, separator=" ") for p in header.find_all_next("p")]
print({header.get_text(strip=True): parts})

打印：

{'SUMMARY1': [
     'This is a region in Europe where the climate is good.', 
     'Total Europe population estimate was used back then.']}

Answer 3

你也可以这样做：

from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})

输出：

{'SUMMARY1': 'This is a region in Europe where the climate is good. Total Europe population estimate was used back then.'}

如何使用相应的标题对象识别html文本？

3 个答案: