如何使用python从维基百科访问数据时解决套接字错误

时间:2017-08-30 06:16:46

标签: python pandas sockets csv pickle

我正在尝试使用python访问维基百科的数据集,代码的目的是访问S& p500公司的表并将数据集提取到csv文件中(每个公司数据在一个csv文件中)一些数据访问得很好,但我得到套接字异常,我觉得有点难以理解。我正在给我完整的代码

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text,  'lxml')
table = soup.find('table', {'class': 'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
    ticker = row.findAll('td')[0].text
    tickers.append(ticker)

with open("sp500tickers.pickle","wb") as f:
    pickle.dump(tickers,f)

return tickers

#save_sp500_tickers()


def get_data_from_yahoo(reload_sp500=False):

if reload_sp500:
    tickers = save_sp500_tickers()
else:
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

if not os.path.exists('stock_dfs'):
    os.makedirs('stock_dfs')

start = dt.datetime(2000, 1, 1)
end = dt.datetime(2016, 12, 31)

for ticker in tickers:
    # just in case your connection breaks, we'd like to save our progress!
    if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
        df = web.DataReader(ticker, "yahoo", start, end)
        df.to_csv('stock_dfs/{}.csv'.format(ticker))
    else:
        print('Already have {}'.format(ticker))

 get_data_from_yahoo()

我得到了如下例外

        Traceback (most recent call last):
      File "C:\Users\Jeet Chatterjee\Data Analysis With Python for finance\op6.py", line 49, in <module>
get_data_from_yahoo()
     File "C:\Users\Jeet Chatterjee\Data Analysis With Python for finance\op6.py", line 44, in get_data_from_yahoo
df = web.DataReader(ticker, "yahoo", start, end)
     File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\data.py", line 121, in DataReader
session=session).read()
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\yahoo\daily.py", line 115, in read
df = super(YahooDailyReader, self).read()
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 181, in read
params=self._get_params(self.symbols))
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 79, in _read_one_data
out = self._read_url_as_StringIO(url, params=params)
       File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 90, in _read_url_as_StringIO
response = self._get_response(url, params=params)
        File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas_datareader\base.py", line 139, in _get_response
raise RemoteDataError('Unable to read URL: {0}'.format(url))
     pandas_datareader._utils.RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/AGN?period1=946665000&period2=1483208999&interval=1d&events=history&crumb=6JtBOAj%5Cu002F6EP

请提前帮助我解决这个问题

1 个答案:

答案 0 :(得分:1)

你所做的事情并没有太大的错误,一个问题是雅虎的时间序列数据并不能保证在100%的时间内可用,它确实会出现和消失。我只是看了雅虎网站;虽然Allergan(AGN)似乎没有问题,但是当我尝试使用Brown Forman(BF.B)和Berkshire Hathaway B(BRK.B)时,这是一个失败的问题。

另一个问题是,你不能假设S&amp; P 500上的每个符号都有你硬编码范围内的时间序列数据;有些只存在于2017年。

以下是您的代码的略微修改版本,它尽最大努力获取所有符号,从2000年1月1日到当天请求数据,如果雅虎没有数据可用则放弃

在撰写本文时,这能够获取当前在S&amp; P 500上的505个符号中的503个的时间序列。注意我使用了代理服务器,您可以删除或注释掉这部分代码。

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests

# proxy servers for internet connection
proxies = {
    'http': 'http://my.proxy.server:8080',
    'https': 'https://my.proxy.server:8080',
}

symbol_filename = "sp500tickers.pickle"

def save_sp500_tickers():    
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies', proxies=proxies)
    soup = bs.BeautifulSoup(resp.text,  'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open(symbol_filename,"wb") as f:
        pickle.dump(tickers,f)
    return tickers


def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500 or not os.path.exists(symbol_filename):
        tickers = save_sp500_tickers()
    else:
        with open(symbol_filename,"rb") as f:
            tickers = pickle.load(f)

    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2000, 1, 1)
    end = dt.datetime(dt.date.today().year, dt.date.today().month, dt.date.today().day) 

    for ticker in tickers:
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            try:
                print ticker
                df = web.DataReader(ticker, "yahoo", start, end)
                df.to_csv('stock_dfs/{}.csv'.format(ticker))
            except:
                print ("No timeseries available for " + ticker)
        else:
            pass # print('Already have {}'.format(ticker))


os.environ["HTTP_PROXY"]=proxies['http']
os.environ["HTTPS_PROXY"]=proxies['https']
get_data_from_yahoo()

希望这有帮助。