所以这就是问题,我试图使用BeautifulSoup从SEC数据库中提取一些数据,我在python中实际上是新的,但我能够编写以下代码。
这个想法是在.txt中使用引号符号列表并提取" CIK"每家公司的数量有待进一步使用。
import requests
from bs4 import BeautifulSoup
list_path = r"C:\Users\User1\Downloads\Quote list.txt"
with open(list_path, "r") as flist:
for quote in flist:
quote = quote.replace("\n", "")
url = (r"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + quote +
r"&type=10&dateb=&owner=exclude&count=100")
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for company_info in soup.find_all("span", {"class" :"companyName"}):
cik_code = company_info.string
print(cik_code)
到目前为止,上面的代码打印"无"字符串' cik_code'的值。 html中的元素如下:
<span class="companyName dm-selected dm-test">
AAON INC
<acronym title="Central Index Key">CIK</acronym>
#:
<a href="/cgi-bin/browse-edgar?
action=getcompany&CIK=0000824142&owner=exclude&count=100"
class="">0000824142 (see all company filings)</a>
</span>
cik代码是最后一个数字:0000824142,就在&#34;之前(见所有公司备案)&#34;
如何将该数字设置为字符串cik_code
答案 0 :(得分:0)
我认为您只需要进入<a>
标记内的<span>
标记。
for company_info in soup.find_all('span', {'class': 'companyName'}):
cik_code = company_info.find_next('a').text.split(' ', maxsplit=1)[0]
print(cik_code)
<强>解释强>
company_info.find_next('a')
返回:<a href="/cgi-bin/browse-edgar? action=getcompany&CIK=0000824142&owner=exclude&count=100" class="">0000824142 (see all company filings)</a>
.text
返回:0000824142 (see all company filings)
.split(' ', maxsplit=1)[0]
返回:
0000824142