我将如何抓取sic代码说明?

时间:2020-06-21 21:21:15

标签: python web-scraping beautifulsoup

嗨,我正在使用BS4抓取sic代码和说明。目前,我有以下代码可以完全满足我的需要,但是我不知道如何在inspect元素视图和视图源中抓取下面的描述图片。

需要明确的是“国有商业银行”和“实验室分析仪器”

https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search

<div class="companyInfo">
     <span class="companyName">COMMERCIAL NATIONAL FINANCIAL CORP /PA <acronym title="Central Index Key">CIK</acronym>#: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;CIK=0000866054&amp;owner=exclude&amp;count=40">0000866054 (see all company filings)</a></span>
     <p class="identInfo"><acronym title="Standard Industrial Code">SIC</acronym>: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;SIC=6022&amp;owner=exclude&amp;count=40">6022</a> - STATE COMMERCIAL BANKS<br />State location: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;State=PA&amp;owner=exclude&amp;count=40">PA</a> | State of Inc.: <strong>PA</strong> | Fiscal Year End: 1231<br />(Office of Finance)<br />Get <a href="/cgi-bin/own-disp?action=getissuer&amp;CIK=0000866054"><b>insider transactions</b></a> for this <b>issuer</b>.

for cik_num in cik_num_list:
try:
    url = r"https://www.sec.gov/cgi-bin/browse-edgar?CIK={}&owner=exclude&action=getcompany".format(cik_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    try:
        comp_name = soup.find_all('div', {'class':'companyInfo'})[0].find('span').text
        sic_code = soup.find_all('p', {'class':'identInfo'})[0].find('a').text

enter image description here enter image description here

1 个答案:

答案 0 :(得分:1)

import requests
from bs4 import BeautifulSoup

url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=866054&owner=exclude&action=getcompany&Find=Search'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

sic_code_desc = soup.select_one('.identInfo').a.find_next_sibling(text=True).split(maxsplit=1)[-1]
print(sic_code_desc)

打印:

STATE COMMERCIAL BANKS

对于url = 'https://www.sec.gov/cgi-bin/browse-edgar?CIK=1090872&owner=exclude&action=getcompany&Find=Search',它会打印:

LABORATORY ANALYTICAL INSTRUMENTS