无法使用硒和read_html从宏观趋势中检索数据来创建数据框?

时间:2019-06-02 17:42:28

标签: python pandas selenium web-scraping beautifulsoup

我想将宏观趋势中的数据导入熊猫数据框。通过查看网站的页面来源,可以看出数据位于jqxgrid中。

我已经尝试过使用带有read_html函数的pandas / beautiful汤,但没有找到表。我目前正在尝试使用硒来提取数据。我希望如果可以移动水平滚动条表jqxgrid并将其提取出来。但是,那没有用。

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time

driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script("window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)


grid = driver.find_element_by_id('jqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')

time.sleep(1)

actions = ActionChains(driver)
time.sleep(1)

for i in range(1,6):
    actions.drag_and_drop_by_offset(scrollbar,i*70,0).perform()
    time.sleep(1)

pd.read_html(grid.get_attribute('outerHTML'))

我得到的错误是:

  

ValueError:找不到表

我希望将“ http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q”中的表格数据导入到数据框中

2 个答案:

答案 0 :(得分:1)

问题在于数据不在表中,而是'div'元素。我不是熊猫专家,但是您可以使用BeautifulSoup做到这一点。

在外部导入后插入行

from bs4 import BeautifulSoup

然后将您的最后一行更改为:

soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

这将找到具有属性“行”的所有“ div”元素。然后读取在“ div”元素下具有属性“ row”的每个div的文本元素,但是由于某些元素具有多个“ div”元素,因此只能下降一个级别。

输出:

                                     0  1       2       3       4       5   \
0                               Revenue     $4,135  $5,672  $3,262  $2,886   
1                    Cost Of Goods Sold     $3,179  $4,501  $2,500  $2,185   
2                          Gross Profit       $956  $1,171    $762    $701   
3     Research And Development Expenses       $234    $222    $209    $201   
4                         SG&A Expenses       $518    $675    $427    $381   
5    Other Operating Income Or Expenses        $-6     $-3     $-3     $-3   
...
        6       7       8       9       10      11      12      13      14  
0   $3,015  $3,986  $2,307  $2,139  $2,279  $2,977  $1,858  $1,753  $1,902  
1   $2,296  $3,135  $1,758  $1,630  $1,732  $2,309  $1,395  $1,303  $1,444  
2     $719    $851    $549    $509    $547    $668    $463    $450    $458  
3     $186    $177    $172    $167    $146    $132    $121    $106     $92  
4     $388    $476    $335    $292    $292    $367    $247    $238    $257  
5        -     $-2     $-2     $-3     $-3     $-4    $-40     $-2     $-1  
...

但是,当您在页面上滚动时,左侧的项目将从页面源中删除,因此并非所有数据都会被刮取。

已更新,以回应评论。 要设置列标题,请使用:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script(
    "window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)

grid = driver.find_element_by_id('wrapperjqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')

time.sleep(1)

actions = ActionChains(driver)
time.sleep(1)

for i in range(1, 6):
    actions.drag_and_drop_by_offset(scrollbar, i * 70, 0).perform()
    time.sleep(1)


soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
headersList = soup.findAll('div', {'role': 'columnheader'})
col_names=[h.text for h in headersList]
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data, columns=col_names)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

输出:

   Quarterly Data | Millions of US $ except per share data   2008-03-31  \
0                                             Revenue            $4,135
1                                  Cost Of Goods Sold            $3,179
2                                        Gross Profit              $956
...
   2007-12-31 2007-09-30 2007-06-30 2007-03-31 2006-12-31 2006-09-30  \
0      $5,672     $3,262     $2,886     $3,015     $3,986     $2,307
1      $4,501     $2,500     $2,185     $2,296     $3,135     $1,758
...
   2006-06-30 2006-03-31 2005-12-31 2005-09-30 2005-06-30 2005-03-31  
0      $2,139     $2,279     $2,977     $1,858     $1,753     $1,902
1      $1,630     $1,732     $2,309     $1,395     $1,303     $1,444

答案 1 :(得分:0)

这是比硒更快的替代方法,硒的标题如下所示。

import requests
from bs4 import BeautifulSoup as bs
import re
import json
import pandas as pd

r = requests.get('https://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q')
p = re.compile(r' var originalData = (.*?);\r\n\r\n\r',re.DOTALL)
data = json.loads(p.findall(r.text)[0])
headers = list(data[0].keys())
headers.remove('popup_icon')
result = []

for row in data:
    soup = bs(row['field_name'])
    field_name = soup.select_one('a, span').text
    fields = list(row.values())[2:]
    fields.insert(0, field_name)
    result.append(fields)

pd.option_context('display.max_rows', None, 'display.max_columns', None)
df = pd.DataFrame(result, columns = headers)
print(df.head())

enter image description here