Pandas的新手,寻找最有效的方法。
我有一系列的DataFrames。每个DataFrame具有相同的列但不同的索引,并且它们按日期索引。该系列由股票代码索引。因此,序列中的每个项目代表每个股票表现的单个时间序列。
我需要随机生成n个数据帧的列表,其中每个数据帧是可用股票历史的一些随机分类的子集。如果有重叠,只要开始结束日期不同,就可以了。
以下代码可以实现,但它确实很慢,我想知道是否有更好的方法:
代码
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if type(data) != pd.Series:
return None
if subset=='validate':
offset = 0
elif subset=='test':
offset = 200
elif subset=='train':
offset = 400
tickers = np.random.randint(0, len(data), size=len(data))
ret_data = []
while len(ret_data) != batch_size:
for t in tickers:
data_t = data[t]
max_len = len(data_t)-timesteps-1
if len(ret_data)==batch_size: break
if max_len-offset < 0: continue
index = np.random.randint(offset, max_len)
d = data_t[index:index+timesteps]
if len(d)==timesteps: ret_data.append(d)
return ret_data
个人资料输出:
Timer unit: 1e-06 s
File: finance.py
Function: random_sample at line 137
Total time: 0.016142 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
137 @profile
138 def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
139 1 5 5.0 0.0 if type(data) != pd.Series:
140 return None
141
142 1 1 1.0 0.0 if subset=='validate':
143 offset = 0
144 1 1 1.0 0.0 elif subset=='test':
145 offset = 200
146 1 0 0.0 0.0 elif subset=='train':
147 1 1 1.0 0.0 offset = 400
148
149 1 1835 1835.0 11.4 tickers = np.random.randint(0, len(data), size=len(data))
150
151 1 2 2.0 0.0 ret_data = []
152 2 3 1.5 0.0 while len(ret_data) != batch_size:
153 116 148 1.3 0.9 for t in tickers:
154 116 2497 21.5 15.5 data_t = data[t]
155 116 317 2.7 2.0 max_len = len(data_t)-timesteps-1
156 116 80 0.7 0.5 if len(ret_data)==batch_size: break
157 115 69 0.6 0.4 if max_len-offset < 0: continue
158
159 100 101 1.0 0.6 index = np.random.randint(offset, max_len)
160 100 10840 108.4 67.2 d = data_t[index:index+timesteps]
161 100 241 2.4 1.5 if len(d)==timesteps: ret_data.append(d)
162
163 1 1 1.0 0.0 return ret_data
答案 0 :(得分:1)
您确定需要找到更快的方法吗?你目前的方法并不慢。以下更改可能会简化,但不一定更快:
步骤1:从数据帧列表中随机抽取样本(替换)
rand_stocks = np.random.randint(0, len(data), size=batch_size)
您可以将此数组rand_stocks
视为要应用于数据系列的索引列表。大小已经是批量大小,因此无需在第156行进行while循环和比较。
也就是说,你可以迭代rand_stocks
并像这样访问股票:
for idx in rand_stocks:
stock = data.ix[idx]
# Get a sample from this stock.
第2步:为随机选择的每只股票获取随机数据范围。
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = data_t[start_idx:start_idx+timesteps]
我没有您的数据,但这是我如何把它放在一起:
def random_sample(data=None, timesteps=100, batch_size=100, subset='train'):
if subset=='train': offset = 0 #you can obviously change this back
rand_stocks = np.random.randint(0, len(data), size=batch_size)
ret_data = []
for idx in rand_stocks:
stock = data[idx]
start_idx = np.random.randint(offset, len(stock)-timesteps)
d = stock[start_idx:start_idx+timesteps]
ret_data.append(d)
return ret_data
创建数据集:
In [22]: import numpy as np
In [23]: import pandas as pd
In [24]: rndrange = pd.DateRange('1/1/2012', periods=72, freq='H')
In [25]: rndseries = pd.Series(np.random.randn(len(rndrange)), index=rndrange)
In [26]: rndseries.head()
Out[26]:
2012-01-02 2.025795
2012-01-03 1.731667
2012-01-04 0.092725
2012-01-05 -0.489804
2012-01-06 -0.090041
In [27]: data = [rndseries,rndseries,rndseries,rndseries,rndseries,rndseries]
测试功能:
In [42]: random_sample(data, timesteps=2, batch_size = 2)
Out[42]:
[2012-01-23 1.464576
2012-01-24 -1.052048,
2012-01-23 1.464576
2012-01-24 -1.052048]