我们有几家公司的10-k表格。我们想从HTML获取收入表(第6项)。表格的结构因公司而异。
例如
url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm'
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'
我们需要获取第6项合并财务数据中的表格。
我们尝试的一种方法是基于对项6的字符串搜索,将所有文本从项6转换为项7,然后获得如下表:
doc10K = requests.get(url2)
st6 =doc10K.text.lower().find("item 6")
end6 = doc10K.text.lower().find("item 7")
# get text fro item 6 and removing currency sign
item6 = doc10K.text[st6:end6].replace('$','')
Tsoup = bs.BeautifulSoup(item6, 'lxml')
# Extract all tables from the response
html_tables =Tsoup.find_all('table')
这种方法不适用于所有表格。例如。使用KSS,我们无法找到字符串“ Item6”。理想的输出将是第6项中给出的表。
答案 0 :(得分:1)
petezurich是正确的,但标记未完全定位。
# You can try this, too. The start parameter can be a list, just match any one of the above
doc10K = requests.get(url2)
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc(doc10K.text)
start = doc.html.rfind('Selected Consolidated Financial Data')
if start<0:
start = doc.html.rfind('Selected Financial Data')
tables = doc.getElementsByTag('table',start=start,end=['Item 7','Item 7'])
for table in tables:
trs = table.trs
for tr in trs:
tds = tr.tds
for td in tds:
print(td.text)
# print(td.unescape()) #Replace HTML entity
答案 1 :(得分:0)
字符串item 6
似乎包含空格或不间断空格。
尝试以下清除的代码:
import requests
from bs4 import BeautifulSoup
url1= 'https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm'
url2='https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm'
doc10K = requests.get(url2)
st6 = doc10K.text.lower().find("item 6")
# found "item 6"? if not search search with underscore
if st6 == -1:
st6 = doc10K.text.lower().find("item_6")
end6 = doc10K.text.lower().find("item 7")
item6 = doc10K.text[st6:end6].replace('$','')
soup = BeautifulSoup(item6, 'lxml')
html_tables = soup.find_all('table')
答案 2 :(得分:0)
对于bs4 4.7.1+,您可以使用:contains和:has为基于html的表格指定适当的匹配模式。您可以使用css或语法,以便匹配下面显示的两种模式。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
urls = ['https://www.sec.gov/Archives/edgar/data/794367/000079436719000038/m-0202201910xk.htm','https://www.sec.gov/Archives/edgar/data/885639/000156459019009005/kss-10k_20190202.htm']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = bs(r.content, 'lxml')
table = pd.read_html(str(soup.select_one('table:contains("Item 6") ~ div:has(table) table, p:contains("Selected Consolidated Financial Data") ~ div:has(table) table')))[0]
table.dropna(axis = 0, how = 'all',inplace= True)
table.dropna(axis = 1, how = 'all',inplace= True)
table.fillna(' ', inplace=True)
table.rename(columns= table.iloc[0], inplace = True) #set headers same as row 1
table.drop(table.index[0:2], inplace = True) #lose row 1
table.reset_index(drop=True, inplace = True) #re-index
print(table)