使用Bs4从动态表中刮取行元素

时间:2018-11-18 17:08:39

标签: python python-3.x web-scraping beautifulsoup

我正在尝试从CNBC网站https://www.cnbc.com/nasdaq-100/上获取纳斯达克100的股票代码。我是美丽的汤的新手,但是如果有一种更简单的方法来抓取列表并保存数据,则我对任何解决方案都感兴趣。 下面的代码不会返回错误;但是,它也不会返回任何行情收录器。

import bs4 as bs
import pickle # serializes any python object so that we do not have to go back to the CNBC website to get the tickers each time we want 
# to use the 100 ticker symbols

import requests
    def save_nasdaq_tickers():
        ''' We start by getting the source code for CNBC. We will use the request module for this'''
        resp = requests.get('https://www.cnbc.com/nasdaq-100')
        soup = bs.BeautifulSoup(resp.text,"lxml")# we use txt when the response comes from request module I think because resp.txt is text of source code.
        table = soup.find('table',{'class':"data quoteTable"}) # We want all table of the class we think matches the table data we want from cnbc
        tickers = [] # empty tickers list
        # Next week iterate through the table.
        for row in table.findAll('tr')[1:]:# we want to find all table rows except the header row which should be row 0 so 1 onward [:1]
            ticker = row.findAll('td')[0].txt #td is the columns of the table 0 is the first column which I perceived to be the tickers
       # We specifiy .txt because it is a soup object
            tickers.append(ticker)
        # Save this list of tickers using pickle and with open???
        with open("Nasdaq100Tickers","wb") as f: # name the file Nasdaq100... etc 
            pickle.dump(tickers,f) # dumping the tickers to file f

        print(tickers)

        return tickers
    save_nasdaq_tickers()

2 个答案:

答案 0 :(得分:2)

如果您想知道为什么"<html><body style='margin:0px;padding:0px;'><script type='text/javascript' " + "src='http://www.youtube.com/iframe_api'></script><script type='text/javascript'>" + "function onYouTubeIframeAPIReady(){ytplayer=new YT.Player('playerId'," + "{events:{onReady:onPlayerReady}})}function onPlayerReady(a){a.target.playVideo();}"+ "</script>Youtube video .. <br><iframe id='playerId' type='text/html' width='100%' height='100%' " + "https://www.youtube.com/embed/live_stream?channel=UCYn0pQcA8IMxk4cDFzlBF2w&autoplay=1' frameborder='0' allowfullscreen></body></html>" webview.loadDataWithBaseURL(null, frameVideo, "text/html", "utf-8", null); 中没有任何内容,您的代码中只有一个小错误。 tickersticker = row.findAll('td')[0].txt。但是,当您希望在动态页面中获取全部内容时,则需要ticker = row.findAll('td')[0].text

selenium

答案 1 :(得分:1)

您可以模仿发出的XHR请求并解析出包含您要获取的数据的JSON

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://quote.cnbc.com/quote-html-webservice/quote.htm?partnerId=2&requestMethod=quick&exthrs=1&noform=1&fund=1&output=jsonp&symbols=AAL|AAPL|ADBE|ADI|ADP|ADSK|ALGN|ALXN|AMAT|AMGN|AMZN|ATVI|ASML|AVGO|BIDU|BIIB|BMRN|CDNS|CELG|CERN|CHKP|CHTR|CTRP|CTAS|CSCO|CTXS|CMCSA|COST|CSX|CTSH&callback=quoteHandler1'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('quoteHandler1(').strip(')')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df[['symbol','last']]) 

按以下方式返回JSON(示例已扩展):