我正试图通过此页面检索股票的“杰出股”:
(单击“财务报表”-“简明合并资产负债表(未经审计)(附加)”)
数据在左行表格的底部,我正在使用漂亮的汤,但是在重新获得股票数量时遇到问题。
我正在使用的代码:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
document = row.find('a', string='Common stock, shares outstanding (in shares)')
shares = row.find('td', class_='nump')
if None in (document, shares):
continue
print(document)
print(shares)
这什么都不返回,但是所需的输出是4,323,987,000
有人可以帮我取回这些数据吗?
谢谢!
答案 0 :(得分:1)
那是JS呈现的页面。使用Selenium
:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')
for row in rows:
shares = row.find('td', class_='nump')
if shares:
print(shares)
<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>
更好地使用:
shares = soup.find('td', class_='nump')
if shares:
print(shares.text.strip())
4,334,335
答案 1 :(得分:1)
啊,刮掉EDGAR档案的乐趣:(...
由于找不到正确的位置,因此无法获得预期的输出。您拥有的网址是ixbrl查看器。数据来自这里:
url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'
您可以通过查看开发人员所用的网络标签来找到该URL,也可以将查看者的URL转换为该URL:例如,320193&
数字是Cik编号,等等。< / p>
一旦弄清楚了,其余的就很简单了:
req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()
输出:
'4,334,335'
编辑:
要按“杰出股”进行搜索,请尝试:
targets = soup.select('tr.ro')
for target in targets:
targ = target.select('td.pl')
for t in targ:
if "Shares Outstanding" in t.text:
print(target.select_one('td.nump').text.strip())
不妨将其扔进去:另一种不同的方法是,使用lxml库使用xpath代替:
import lxml.html as lh
doc = lh.fromstring(req.text)
doc.xpath('//tr[@class="ro"]//td[@class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[@class="nump"]/text()')[0]