我正在尝试抓取特定股票的过去 5 年雅虎金融历史数据。我已经实现了一个 python 代码,它抓取包含历史数据的表的每一行。我知道有更简单的方法来获取历史数据,但我想通过抓取来实现。问题是雅虎财经在其中实施了无限滚动,即一旦我到达网站的末尾,就会有更多的行动态添加到表格中。但是我的代码只获取行直到第一页的末尾,而不是完整的 5 年数据。这是我正在尝试的代码示例:
在抓取部分导航到行之后-
tableRows = table.find_all('tr', class_='BdT Bdc($seperatorColor) Ta(end) Fz(s) Whs(nw)')
我正在进一步从这些行中提取值
答案 0 :(得分:1)
您需要在浏览器中模仿用户行为才能获取其余的结果。
答案 1 :(得分:1)
我建议你试试 yfinance 库 (https://pypi.org/project/yfinance/)
import yfinance as yf
msft = yf.Ticker("MSFT")
# get stock info
msft.info
# get historical market data
hist = msft.history(period="max")
答案 2 :(得分:1)
Selenium 是一种方法。更高效的方式是直接查询数据:
import requests
import pandas as pd
import datetime
years = 5
dt= datetime.datetime.now()
past_date = datetime.datetime(year=dt.year-years, month=dt.month, day=dt.day)
url = 'https://query2.finance.yahoo.com/v8/finance/chart/RELIANCE.NS'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}
payload = {
'formatted': 'true',
'crumb': 'J2oUJNHQwXU',
'lang': 'en-GB',
'region': 'GB',
'includeAdjustedClose': 'true',
'interval': '1d',
'period1': '%s' %int(past_date.timestamp()),
'period2': '%s' %int(dt.timestamp()),
'events': 'div|split',
'useYfid': 'true',
'corsDomain': 'uk.finance.yahoo.com'}
jsonData = requests.get(url, headers=headers, params=payload).json()
result = jsonData['chart']['result'][0]
indicators = result['indicators']
rows = {'timestamp':result['timestamp']}
rows.update(indicators['adjclose'][0])
rows.update(indicators['quote'][0])
df = pd.DataFrame(rows)
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
输出:
print(df)
timestamp adjclose ... open low
0 2016-03-08 03:45:00 492.139252 ... 499.019806 499.019806
1 2016-03-09 03:45:00 499.183502 ... 505.211090 504.517670
2 2016-03-10 03:45:00 484.831451 ... 516.132568 499.762756
3 2016-03-11 03:45:00 486.149292 ... 502.685059 500.555237
4 2016-03-14 03:45:00 488.665009 ... 504.765320 501.719208
... ... ... ... ...
1229 2021-03-01 03:45:00 2101.699951 ... 2110.199951 2062.500000
1230 2021-03-02 03:45:00 2106.000000 ... 2122.000000 2089.100098
1231 2021-03-03 03:45:00 2202.100098 ... 2121.050049 2107.199951
1232 2021-03-04 03:45:00 2175.850098 ... 2180.000000 2157.699951
1233 2021-03-05 09:59:59 2178.699951 ... 2156.000000 2153.050049
[1234 rows x 7 columns]
答案 3 :(得分:0)
已经展示了很多更好的解决方案,但我只是向您展示如何通过按“END”键来完成
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.implicitly_wait(6)
driver.get("https://uk.finance.yahoo.com/quote/RELIANCE.NS/history?period1=1297987200&period2=1613606400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/form/div[2]/div[2]/button').click()
history_table = driver.find_element_by_xpath('//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table/tbody').find_elements_by_tag_name("tr")
# while year >= 2020 - 5
while(int(history_table[-1].find_elements_by_tag_name("td")[0].text.split()[2]) >= 2020-5):
history_table = driver.find_element_by_xpath(
'//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table/tbody').find_elements_by_tag_name("tr")
action = ActionChains(driver)
action.send_keys(Keys.END).perform()