Question

我可能有一个包含以下信息的文档：

<h1>Some Text</h1>
<p>A person name</p>
<p><i>Works somewhere, in some country</i></p>
<p>Grab this text as well</p>

此块基本上将重复x次。我需要提取此信息。但是，<p> tags的数量会有所不同，因此可以在h1 tag再次出现之前使用7个单独的数字。我也在使用beautifulsoup来解决这个问题。

我可以提取此数据，但是不能制定规则，因此对于每个h1 tag，请提取x个标记，直到再次成为h1 tag。

因此，每次出现h1标签时，这都是一条新记录。

希望这很有意义，谢谢！

Answer 1

您希望将哪种数据结构存储在其中？

您可以使用python .split()函数并用"<h1>"分割，这将为您提供如下所示的内容：

text = """<h1>Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>
       <h1>Some More Text</h1>
       <p>Grab this</p>"""

textChunks = text.split("<h1>")

然后textChunk看起来就像

["""Some Text</h1>
       <p>A person name</p>
       <p><i>Works somewhere, in some country</i></p>
       <p>Grab this text as well</p>""",
 """Some More Text</h1>
       <p>Grab this</p>"""]

您可以通过遍历数组或使用beautifulsoup来不同地对待每个单独的块。

如何使用Python从HTML文本中提取信息

1 个答案: