使用BeautifulSoup抓取数据时返回空DataFrame

时间:2020-03-12 10:00:34

标签: python beautifulsoup

我正在尝试使用BeautifulSoup从其网站上收集有关NASDAQ-100的CNBC数据,但是当我尝试将其数据更改为DataFrame时,它显示的数据框为空,列:[],索引:[]

下面是我的代码:

# Importing Libraries
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

# Create parse tree for parsed pages 
page=requests.get("https://www.cnbc.com/nasdaq-100")

#content=page.content

# Scrape data from specific <div> column
# Title for the data table -> NASDAQ-100

soup=BeautifulSoup(page.content,"html.parser")
l = []
title=soup.find("div",{"class":"PageHeader-main"}).find("h1").text

table=soup.find_all("table",{"class":"BasicTable-basicTable"})

for items in table:
    for i in range(len(items.find_all("tr"))-1):
        # Gather data
        d = {}
        d["stock_symbol"] = items.find_all("td", {"class":"BasicTable-symbol"})[i].find("a").text
        d["stock_name"] = items.find_all("td", {"class":"BasicTable-name"})[i].text
        d["price"] = items.find_all("td", {"class":"BasicTable-unchanged BasicTable-numData"})[i].text
        d["price_change"] = items.find_all("td", {"class":"BasicTable-quoteDecline"})[i].text
        d["percentage_change"] = items.find_all("td", {"class":"BasicTable-quoteDecline"})[i].text
        # Print ("")
        l.append(d)         
df = pd.DataFrame(l)
print(df)

1 个答案:

答案 0 :(得分:2)

您正在处理的网站是在页面加载后使用JavaScript呈现其数据,因此,我们现在有2个选项。

  1. 以跟踪XHR请求到API数据所在的位置 检索并获取。
  2. 使用selenium方法。

下面列出了两种解决方案:

import requests
import json

r = requests.get("https://quote.cnbc.com/quote-html-webservice/quote.htm?noform=1&partnerId=2&fund=1&exthrs=0&output=json&symbolType=issue&symbols=153171|172296|74548134|178129|90065764|185811|181702|3145559|8279577|8392868|196573|197784|177124|144094|205778|207106|208206|208526|217706|211573|217809|218647|25427545|223056|225584|226052|226354|90065765|227524|237331|240690|244210|253970|263397|248911|264170|256951|273612|24812378|274516|7186257|9079610|4038959|282500|21167615|282560|283581|284350|50675033|288727|288976|289807&requestMethod=extended").json()

data = json.dumps(r, indent=4)

print(data)

print(r.keys())
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd
from time import sleep

options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)

driver.get("https://www.cnbc.com/nasdaq-100")

sleep(2)

df = pd.read_html(driver.page_source)[0]

print(df)
df.to_csv("result.csv", index=False)

driver.quit()

输出:check-online

示例:

enter image description here