我正在尝试做一个学校项目,使用给定的股票报价器名称,并找到在SeekingAlpha上“观看”的人数,但是当我尝试打印时,我一直得到“无”值。我该如何解决?
这是我第一次尝试Web抓取,但是我对BeautifulSoup进行了一些研究,并认为这是最好的使用方法。我也在使用Anaconda环境。在我的代码中,我试图找到该股票的完整公司名称,以及在SeekingAlpha上关注该股票的人数。出于某种原因,我能够检索股票行情的公司名称,但是当我尝试打印关注者的数量时,它会显示“无”。我尝试了所有我能想到的变体来找到关注者,但他们都导致了“无”。
Here is the HTML: (Here I want the value 83,530)
这是我的代码:
import requests
import urllib.request as urllib2
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from lxml import etree
listOfTickers = ["ATVI", "GOOG", "AAPL", "AMZN", "BRK.B", "BRK.A", "NFLX", "SNAP"]
for i in range(len(listOfTickers)):
ticker = listOfTickers[i]
quotePage = Request("https://seekingalpha.com/symbol/" + ticker, headers = {'User-Agent': 'Mozilla/5.0'})
page = urlopen(quotePage).read()
soup = BeautifulSoup(page, "lxml")
company_name = soup.find("div", {"class" :"ticker-title"})
followers_number = soup.find('div', {"class":"followers-number"})
company= company_name.text.strip()
#followers = followers_number.text.strip()
print(followers_number)
print(company)
这是我的结果
答案 0 :(得分:0)
尝试以下方法以获得所需的输出。您希望获取的内容是动态生成的,因此请求模块或urllib不会有任何帮助。您可以使用任何浏览器模拟器,也可以使用requests-html来解决问题。也没有必要使用BeautifulSoup。但是,我之所以保留它只是因为您一开始就使用它。
from requests_html import HTMLSession
from bs4 import BeautifulSoup
tickers = ["ATVI", "GOOG", "AAPL", "AMZN"]
with HTMLSession() as session:
for i in range(len(tickers)):
quotePage = session.get("https://seekingalpha.com/symbol/{}".format(tickers[i]))
quotePage.html.render(5)
soup = BeautifulSoup(quotePage.html.html, "lxml")
followers_number = soup.find(class_="followers-number")
print(followers_number)
您可能会得到的输出:
<div class="followers-number">(<span>83,532</span> followers)</div>
<div class="followers-number" title="1,032,510">(<span>1.03M</span> followers)</div>
<div class="followers-number" title="2,065,199">(<span>2.07M</span> followers)</div>
答案 1 :(得分:0)
只需使用页面执行的相同端点即可检索包括#个关注者的订阅信息:
import requests
tickers = [ "atvi", "goog", "aapl", "amzn", "brk.b", "brk.a", "nflx", "snap"]
with requests.Session() as s:
for ticker in tickers:
r = s.get('https://seekingalpha.com/memcached2/get_subscribe_data/{}?id={}'.format(ticker, ticker)).json()
print(ticker, r['portfolio_count'])
答案 2 :(得分:0)
随着关注者计数通过ajax进行加载,BeautifulSoup
无法访问其值。使用selineum / Phantojs之类的无头浏览器,您可以获得完整的html include javascript生成的内容。另一种方法是通过向端点发出额外的请求,在端点上使用javascript渲染页面的某些部分。这是一个可行的解决方案
import requests
import urllib.request as urllib2
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from lxml import etree
listOfTickers = ["ATVI", "GOOG", "AAPL", "AMZN", "BRK.B", "BRK.A", "NFLX", "SNAP"]
def getFollowersCount(ticker):
# build url
url = 'https://seekingalpha.com/memcached2/get_subscribe_data/{}?id={}'.format(ticker.lower(), ticker.lower())
# Using requests module not urllib.request.
counter = requests.get(url)
# If requests return is json than return portfolio_count otherwise 0
try:
return counter.json()['portfolio_count']
except:
return 0
for ticker in listOfTickers:
quotePage = Request("https://seekingalpha.com/symbol/" + ticker, headers = {'User-Agent': 'Mozilla/5.0'})
page = urlopen(quotePage).read()
soup = BeautifulSoup(page, "lxml")
company_name = soup.find("div", {"class" :"ticker-title"})
#followers_number = soup.find('div', {"class":"followers-number"})
followers_number = getFollowersCount(ticker)
company= company_name.text.strip()
#followers = followers_number.text.strip()
print(followers_number)
print(company)
答案 3 :(得分:0)
最好的方法是监视网络。 Shift + Ctrl + I(在Windows上),查看页面如何发送和接收数据。 :)您将看到数据来自“ https://seekingalpha.com/memcached2/get_subscribe_data”,因此这将为您完成工作:
from collections import defaultdict
from requests import Session
tickers = ['atvi', 'goog','aapl', 'amzn']
storage = defaultdict(str) # storing data
URL = 'https://seekingalpha.com/memcached2/get_subscribe_data'
# Start a session. Here you can add headers, or(and) cookies
curl = Session()
for tick in tickers:
param = {'id':tick}
response = curl.get(f'{URL}/{tick}', params=param).json()
storage[tick] = response['portfolio_count']
# show the results
print(storage)