Question

我有1000多个html文件，我想从这些文件中提取“ ITEM 1A。风险因素”部分。没有文件具有任何ID或任何内容，并且大多数文件具有不同的格式，例如，其中一些文件带有“ div”标签中的文本，另一些文件则具有“ p”，“表”等中的内容。

鉴于特定格式，我可以提取一段文字。例如here;我能够从ITEM 1A部分中提取文本。使用这段代码的风险因素。

should_print = False

for item in soup.find_all("div"):
    if (item.name == "div" and item.parent.name != "div"):
        if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
            should_print = True
        elif "ITEM" in item.text and "1B" in item.text:
            break
        if should_print:
            with open(r"RF.html", "a") as f:
                f.write(str(item))

我可以编写代码以适应所有格式，但是我将如何确定在哪个文件上运行什么代码？假设，如果我在包含“ p”标签中文本的文件上运行此^代码，它将给我带来垃圾文本。

Here和here是html文件的更多示例。

Answer 1

您只需要更改if条件，因为您将false更改为true，但循环中的项仍引用soup.find_all("div")

根据条件更改：

  if "ITEM" in item.text and "1A" in item.text and "RISK" in item.text and "FACTORS" in item.text:
        print (item.find('b').text)

输出：

ITEM 1A. RISK FACTORS.

在if语句中：

打印（item.text）将显示所有文本

print（项目）将显示所有具有字符串ITEM，1A，RISK的来源

Answer 2

一个不错的选择是使用 XPath 查找部分标题，这可以提供通用的解决方案。下面，在bash中使用xmllint但在python中使用xml.etree.ElementTree的示例应该可以完成工作

xmllint -html -recover -xpath '//div[descendant-or-self::*[.="ITEM 1A. RISK FACTORS."]]/descendant-or-self::text()' 2>/dev/null 10k.htm

Xpath解释：

//div[descendant-or-self::...获取具有表达式所定义的子项的div（如下所述）。
descendant-or-self::*[.="ITEM 1A. RISK FACTORS."]查找包含期望标题的任何节点。
descendant-or-self::text()获取所有包含的元素的文本。

使用contains(...)

来检测标题的Xpath

'//div[descendant-or-self::text()[contains(.,"ITEM 1A. RISK FACTORS")]]/descendant-or-self::text()'

我想从数千个不同格式的html文件中提取文本

2 个答案: