我正在从雅虎下载S& P500的股票价格,这对于32位整数来说太大了。
def yahoo_prices(ticker, start_date=None, end_date=None, data='d'):
csv = yahoo_historical_data(ticker, start_date, end_date, data)
d = [('date', np.datetime64),
('open', np.float64),
('high', np.float64),
('low', np.float64),
('close', np.float64),
('volume', np.int64),
('adj_close', np.float64)]
return np.recfromcsv(csv, dtype=d)
这是错误:
>>> sp500 = yahoo_prices('^GSPC')
Traceback (most recent call last):
File "<stdin>", line 108, in <module>
File "<stdin>", line 74, in yahoo_prices
File "/usr/local/lib/python2.6/dist-packages/numpy/lib/npyio.py", line 1812, in recfromcsv
output = genfromtxt(fname, **kwargs)
File "/usr/local/lib/python2.6/dist-packages/numpy/lib/npyio.py", line 1646, in genfromtxt
output = np.array(data, dtype=ddtype)
OverflowError: long int too large to convert to int
如果我声明dtype使用int64,为什么还会出现此错误?这是否表明io函数并未真正使用我的dtype序列d
?
===编辑...示例csv添加===
Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88
答案 0 :(得分:3)
我不确定,但我认为你发现了numpy中的一个错误。我提交了here。
正如我在那里所说,如果您打开npyio.py并在recfromcsv
中修改此行:
kwargs.update(dtype=kwargs.get('update', None),
到此:
kwargs.update(dtype=kwargs.get('dtype', None),
然后它对我来说对于长整数没有问题(我没有像Joe在他的回答中所写的那样检查日期时间的正确性)。您可能会注意到您的日期也未被转换。以下是有效的特定代码。 “test.csv”的内容是从您的示例csv数据中复制粘贴的。
import numpy as np
d = [('date', np.datetime64),
('open', np.float64),
('high', np.float64),
('low', np.float64),
('close', np.float64),
('volume', np.int64),
('adj_close', np.float64)]
a = np.recfromcsv("test.csv", dtype=d)
print(a)
[ (datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1329.19, 1343.32, 1329.19, 1342.84, 4401570000, 1342.84)
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1314.88, 1333.68, 1314.14, 1329.1, 3687720000, 1329.1)
(datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1324.02, 1327.28, 1310.51, 1314.88, 3506510000, 1314.88)]
我还通过在datetime字段中使用本机python对象来“修复”日期时间问题。我不知道这是否适合你。
import datetime
import numpy as np
d = [('date', datetime.datetime),
('open', np.float64),
('high', np.float64),
('low', np.float64),
('close', np.float64),
('volume', np.int64),
('adj_close', np.float64)]
#a = np.recfromcsv("test.csv", dtype=d)
kwargs = {"dtype": d}
case_sensitive = kwargs.get('case_sensitive', "lower") or "lower"
names = kwargs.get('names', True)
kwargs.update(
delimiter=kwargs.get('delimiter', ",") or ",",
names=names,
case_sensitive=case_sensitive)
output = np.genfromtxt("test.csv", **kwargs)
output = output.view(np.recarray)
print(output)
答案 1 :(得分:1)
您需要将日期字符串转换为实际日期。您的dtype中的格式将被忽略,因为第一列无法直接转换为日期时间。
numpy
希望您明确表达并拒绝猜测日期格式。
(编辑:以前是这种情况,但现在已经不是了。)
它需要datetime对象。如果您想从字符串中猜出日期/时间格式,请参阅dateutil.parser
。
无论如何,您需要以下内容:
from cStringIO import StringIO
import datetime as dt
import numpy as np
dat = """Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88"""
infile = StringIO(dat)
d = [('date', np.datetime64),
('open', np.float64),
('high', np.float64),
('low', np.float64),
('close', np.float64),
('volume', np.int64),
('adj_close', np.float64)]
def parse_date(item):
return dt.datetime.strptime(item, '%Y-%M-%d')
data = np.recfromcsv(infile, converters={0:parse_date}, dtype=d)
然而,像这样的事情是pandas
闪耀的地方。考虑使用以下内容:
from cStringIO import StringIO
import pandas
dat = """Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88"""
infile = StringIO(dat)
data = pandas.read_csv(infile, index_col=0, parse_dates=True)