我是编程,Python和Pandas的新手,所以希望这不是一个愚蠢的问题。
我从here下载了一些外汇数据。对于所有货币对,一个月的数据大约为50万行CSV格式。
我希望最终能够在多个时间框架和仪器上测试策略。
以下是我使用的代码:
file_address = '/Users/Oliver/PyCharm/FX_app/test_data/EURUSD_test.csv'
df = pd.read_csv( file_address,
names = ['Symbol', 'Date_Time', 'Bid', 'Ask'],
index_col = 1,
parse_dates = True,
converters = { 'Date_Time': convert_string_to_datetime }
)
# a non-PEP8 format
# didactic purpose
除了截断的测试文件之外的任何内容,这个读取过程需要很长时间。
非常感谢任何帮助。
答案 0 :(得分:0)
我曾经玩过一些现金股票的逐笔记录数据(前30%的流动性股票,每天超过5百万的记录)。以下是使用chunksize
和hdf5
处理文件阅读问题的策略。
import pandas as pd
# this is your FX file path
file_path = '/home/Jian/Downloads/EURUSD-2015-05.csv'
# read into 10,000 rows per chunk, lazy generator, very fast
file_reader = pd.read_csv(file_path, header=None, names=['Symbol', 'Date_time', 'Bid', 'Ask'], index_col=['Date_time'], parse_dates=['Date_time'], chunksize=10000)
# create your HDF5 at any path you like, with compression level 5 (0-9, 9 is extreme)
Jian_h5 = '/media/Primary Disk/Jian_Python_Data_Storage.h5'
h5_file = pd.HDFStore(Jian_h5, complevel=5, complib='blosc')
# then write all records into hdf5 file
# this will take a while ... but it emphasizes on re-usability across different IPython sessions
i = 1
for chunk in file_reader:
h5_file.append('fx_tick_data', chunk, complevel=5, complib='blosc')
i += 1
print('Writing Chunk no.{}'.format(i))
Writing Chunk no.1
Writing Chunk no.2
Writing Chunk no.3
Writing Chunk no.4
...
Writing Chunk no.425
# check your hdf5 file, all 4,237,535 records are there
h5_file
Out[8]:
<class 'pandas.io.pytables.HDFStore'>
File path: /media/Primary Disk/Jian_Python_Data_Storage.h5
/fx_tick_data frame_table (typ->appendable,nrows->4237535,ncols->3,indexers->[index])
# close file IO
h5_file.close()
# the advantage is that after you closing your current session,
# you can still read the file very quickly when you reopen another session
# reopen your IPython session
Jian_h5 = '/media/Primary Disk/Jian_Python_Data_Storage.h5'
h5_file = pd.HDFStore(Jian_h5)
%time fx_df = h5_file['fx_tick_data']
CPU times: user 1.93 s, sys: 439 ms, total: 2.37 s
Wall time: 2.37 s
Out[12]:
Symbol Bid Ask
Date_time
2015-05-01 00:00:00.017000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:00.079000 EUR/USD 1.1212 1.1212
2015-05-01 00:00:00.210000 EUR/USD 1.1212 1.1213
2015-05-01 00:00:00.891000 EUR/USD 1.1212 1.1213
2015-05-01 00:00:05.179000 EUR/USD 1.1212 1.1213
2015-05-01 00:00:06.257000 EUR/USD 1.1212 1.1213
2015-05-01 00:00:09.195000 EUR/USD 1.1212 1.1213
2015-05-01 00:00:09.242000 EUR/USD 1.1212 1.1212
2015-05-01 00:00:09.257000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:09.311000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:09.538000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:14.177000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:14.238000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:15.886000 EUR/USD 1.1211 1.1212
2015-05-01 00:00:17.122000 EUR/USD 1.1211 1.1212
... ... ... ...
2015-05-31 23:59:45.054000 EUR/USD 1.0958 1.0959
2015-05-31 23:59:45.063000 EUR/USD 1.0958 1.0958
2015-05-31 23:59:45.065000 EUR/USD 1.0958 1.0958
2015-05-31 23:59:45.073000 EUR/USD 1.0958 1.0958
2015-05-31 23:59:45.076000 EUR/USD 1.0958 1.0958
2015-05-31 23:59:45.210000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:45.308000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:45.806000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:45.809000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:45.909000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:46.316000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:46.527000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:47.711000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:51.721000 EUR/USD 1.0957 1.0958
2015-05-31 23:59:57.063000 EUR/USD 1.0957 1.0958
[4237535 rows x 3 columns]
不错,我们在将来的会话中只需要大约2秒即可从HDF5读取整个文件。