Question

我正在尝试从HTML文件中提取特定部分。具体来说，我要查找10-K文件（某家公司的美国业务报告）中的“ ITEM 1”部分。例如。： https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002

问题：但是，我找不到“ ITEM 1”部分，也没有办法告诉我的算法从该点“ ITEM 1”搜索到另一个点（例如“ ITEM 1A”）并提取介于两者之间的文本。

我非常感谢您的帮助。

在其他示例中，我已经尝试过此操作（和类似操作），但是我的bd始终为空：

    try:
        # bd = soup.body.findAll(text=re.compile('^ITEM 1$'))
        # bd = soup.find_all(name="ITEM 1")
        # bd = soup.find_all(["ITEM 1", "ITEM1", "Item 1", "Item1", "item 1", "item1"])

        print(" Business Section (Item 1): ", bd.content)

    except:
        print("\n Section not found!")

使用Python 3.7和Beautifulsoup4

关于贺卡

Answer 1

正如我在评论中提到的那样，由于EDGAR的性质，这可能适用于一个申请，但不适用于另一个。不过，这些原则通常应该有效（经过一些调整...）

import requests
import lxml.html

url = 'https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)

tabs = doc.xpath('//table[./tr/td/font/a[@name="a_002"]]/following-sibling::p/font')
#in this filing, Item 1 is hiding in a series of <p> tags following a table with an <a> tag with a 
#"name" attribute which has a value of "a_002"
flag = ''
for i in tabs:
    if flag == 'stop':
        break
    if i.text is not None: #we now start extracting the text from each <p> tag and move to the next
        print(i.text_content().strip().replace('\n',''))
    nxt = i.getparent().getnext()
    #the following detects when the <p> tags of Item 1 end and the next Item begins and then stops 
    if str(type(nxt)) != "<class 'NoneType'>" and nxt.tag == 'table':
        for j in nxt.iterdescendants():
           if j.tag == 'a' and j.values()[0]=='a_003':
                 # we have encountered the <a> tag with a "name" attribute which has a value of "a_003", indicated the beginning of the next Item; so we stop
                 flag='stop'

输出是此归档中项目1的文本。

Answer 2

有特殊字符。首先删除它们

import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = requests.get('https://www.sec.gov/Archives/edgar/data/1591890/000149315218003887/form10-k.htm#a_002').text
doc = SimplifiedDoc(html)
doc.loadHtml(doc.replaceReg(doc.html, 'ITEM[^\S]+1','ITEM '))
item1 = doc.getElementByText('ITEM 1')
print(item1) # {'tag': 'B', 'html': 'ITEM 1. EXECUTIVE COMPENSATION'}

# Here's what you might use
table = item1.getParent('TABLE')
trs = table.TRs
for tr in trs:
  print (tr.TDs)

从（Edgar 10-K文件）HTML

2 个答案: