我想将宏观趋势中的数据导入熊猫数据框。通过查看网站的页面来源,可以看出数据位于jqxgrid中。
我已经尝试过使用带有read_html函数的pandas / beautiful汤,但没有找到表。我目前正在尝试使用硒来提取数据。我希望如果可以移动水平滚动条表jqxgrid并将其提取出来。但是,那没有用。
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script("window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)
grid = driver.find_element_by_id('jqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')
time.sleep(1)
actions = ActionChains(driver)
time.sleep(1)
for i in range(1,6):
actions.drag_and_drop_by_offset(scrollbar,i*70,0).perform()
time.sleep(1)
pd.read_html(grid.get_attribute('outerHTML'))
我得到的错误是:
ValueError:找不到表
我希望将“ http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q”中的表格数据导入到数据框中
答案 0 :(得分:1)
问题在于数据不在表中,而是'div'元素。我不是熊猫专家,但是您可以使用BeautifulSoup做到这一点。
在外部导入后插入行
from bs4 import BeautifulSoup
然后将您的最后一行更改为:
soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
这将找到具有属性“行”的所有“ div”元素。然后读取在“ div”元素下具有属性“ row”的每个div的文本元素,但是由于某些元素具有多个“ div”元素,因此只能下降一个级别。
输出:
0 1 2 3 4 5 \
0 Revenue $4,135 $5,672 $3,262 $2,886
1 Cost Of Goods Sold $3,179 $4,501 $2,500 $2,185
2 Gross Profit $956 $1,171 $762 $701
3 Research And Development Expenses $234 $222 $209 $201
4 SG&A Expenses $518 $675 $427 $381
5 Other Operating Income Or Expenses $-6 $-3 $-3 $-3
...
6 7 8 9 10 11 12 13 14
0 $3,015 $3,986 $2,307 $2,139 $2,279 $2,977 $1,858 $1,753 $1,902
1 $2,296 $3,135 $1,758 $1,630 $1,732 $2,309 $1,395 $1,303 $1,444
2 $719 $851 $549 $509 $547 $668 $463 $450 $458
3 $186 $177 $172 $167 $146 $132 $121 $106 $92
4 $388 $476 $335 $292 $292 $367 $247 $238 $257
5 - $-2 $-2 $-3 $-3 $-4 $-40 $-2 $-1
...
但是,当您在页面上滚动时,左侧的项目将从页面源中删除,因此并非所有数据都会被刮取。
已更新,以回应评论。 要设置列标题,请使用:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.maximize_window()
driver.execute_script(
"window.location = 'http://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q';")
driver.implicitly_wait(2)
grid = driver.find_element_by_id('wrapperjqxgrid')
time.sleep(1)
driver.execute_script("window.scrollBy(0, 600);")
scrollbar = driver.find_element_by_id('jqxScrollThumbhorizontalScrollBarjqxgrid')
time.sleep(1)
actions = ActionChains(driver)
time.sleep(1)
for i in range(1, 6):
actions.drag_and_drop_by_offset(scrollbar, i * 70, 0).perform()
time.sleep(1)
soup = BeautifulSoup(grid.get_attribute('outerHTML'), "html.parser")
headersList = soup.findAll('div', {'role': 'columnheader'})
col_names=[h.text for h in headersList]
divList = soup.findAll('div', {'role': 'row'})
data = [[x.text for x in div.findChildren('div', recursive=False)] for div in divList]
df = pd.DataFrame(data, columns=col_names)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df)
输出:
Quarterly Data | Millions of US $ except per share data 2008-03-31 \
0 Revenue $4,135
1 Cost Of Goods Sold $3,179
2 Gross Profit $956
...
2007-12-31 2007-09-30 2007-06-30 2007-03-31 2006-12-31 2006-09-30 \
0 $5,672 $3,262 $2,886 $3,015 $3,986 $2,307
1 $4,501 $2,500 $2,185 $2,296 $3,135 $1,758
...
2006-06-30 2006-03-31 2005-12-31 2005-09-30 2005-06-30 2005-03-31
0 $2,139 $2,279 $2,977 $1,858 $1,753 $1,902
1 $1,630 $1,732 $2,309 $1,395 $1,303 $1,444
答案 1 :(得分:0)
这是比硒更快的替代方法,硒的标题如下所示。
import requests
from bs4 import BeautifulSoup as bs
import re
import json
import pandas as pd
r = requests.get('https://www.macrotrends.net/stocks/charts/AMZN/amazon/income-statement?freq=Q')
p = re.compile(r' var originalData = (.*?);\r\n\r\n\r',re.DOTALL)
data = json.loads(p.findall(r.text)[0])
headers = list(data[0].keys())
headers.remove('popup_icon')
result = []
for row in data:
soup = bs(row['field_name'])
field_name = soup.select_one('a, span').text
fields = list(row.values())[2:]
fields.insert(0, field_name)
result.append(fields)
pd.option_context('display.max_rows', None, 'display.max_columns', None)
df = pd.DataFrame(result, columns = headers)
print(df.head())