Question

我在一个文件夹中有30911个html文件。我需要（1）检查它是否包含标签：

<strong>123</strong>

和（2）提取以下内容直到本节结束。

但我发现问题是其中一些在

之前结束了

<strong>567</strong>

其中一些没有这样的标签，在

之前结束

<strong>89/strong> or others(that I do not know because I cant check 30K+files)

每个文件中也有不同的p p_number，有时没有id

首先我使用beautifulsoup进行搜索，但我不知道如何进行下一个提取内容

soup = bs4.BeautifulSoup(fo, "lxml")
m = soup.find("strong", string=re.compile("123"))

是的，可以将内容保存为txt格式，但它看起来像html格式？

line 1
line 2
...
lin 50

如果使用p.get_text（strip = true），它们就在一起了。

line1 content line2 content ... 
line50 content....

Answer 1

如果我理解正确，您可以先找到起点 - strong元素，其中p元素带有“问答会话”文本。然后，您可以遍历strong元素的next siblings，直到您点击具有“版权政策”文字的import re from bs4 import BeautifulSoup data = """ <body> Question-and-Answer Session Hi John and Greg, good afternoon. contents.... Copyright policy: other content about the policy.... </body> """ soup = BeautifulSoup(data, "html.parser") def find_question_answer(tag): return tag.name == 'p' and tag.find("strong", text=re.compile(r"Question-and-Answer Session")) question_answer = soup.find(find_question_answer) for p in question_answer.find_next_siblings("p"): if p.find("strong", text=re.compile(r"Copyright policy")): break print(p.get_text(strip=True))元素的元素。

完整的可重复示例：

Hi John and Greg, good afternoon. contents....

打印：

{{1}}

beautifulsoup解析html文件内容

1 个答案: