Question

我知道一个类似的问题有多个答案，但是，我无法为我的情况找到答案。

我有成千上万个html文件，并且需要从这些文件中提取一个带有标题及其正文的部分，“第1A项：风险因素”

这是html文件的link。我想从第25页开始提取文本，即项目1A。风险因素，直到第37页，即本节的结束位置。

我可以将其本身提取为HTML格式，也可以文本格式提取，一切正常。

我正在寻找的

This is something。抱歉，Google驱动器链接，我找不到其他方法来获取此链接。

Answer 1

有很多方法可以做到这一点。 BeautifulSoup和requests将使您的生活更轻松。我敢肯定，还有更多的最佳解决方案，但是有一个简单的解决方案可以演示如何实现此目的。

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup

res = requests.get('https://www.sec.gov/Archives/edgar/data/4904/000000490412000013/ye11aep10k.htm')

soup = BeautifulSoup(res.text)

wanted_pages = ['25', '37']

page_divs = [div for div in soup.find_all('div', id='PN')
            if div.font.text in wanted_pages]

wanted_page_indices = [str(soup).find(str(div)) for div in page_divs]

section_str = str(soup)[slice(*wanted_page_indices)]

section_html = BeautifulSoup(section_str).prettify()

# do something with the section

您可以通过pip安装第三方库。您可以将这样的内容粘贴到for循环中，并按照我的理解对每个html页面执行此操作。希望我能正确回答您的问题，希望对您有所帮助。

Python：我想从html文件中提取一段文字

1 个答案: