我正在寻找向量化的对象,以创建一个numpy二维数组,其中每行包含使用熊猫系列的滑动窗口提取的64天数据,该系列的数据超过6000天。
窗口大小为64,跨度为1。
以下是基于Ingrid答案的具有简单循环和列表连接的解决方案:
# Set up a dataframe with 6000 random samples
df = pd.DataFrame(np.random.rand(6000),columns=['d_ret'])
days_of_data = df['d_ret'].count()
n_D = 64 # Window size
# The dataset will have m = (days_of_data - n_D + 1) rows
m = days_of_data - n_D + 1
# Build the dataset with a loop
t = time.time() # Start timing
X = np.zeros((m,n_D)) # Initialize np array
for day in range(m): # Loop from 0 to (days_of_data - n_D + 1)
X[day][:] = df['d_ret'][day:day+n_D].values # Copy content of sliding window into array
elapsed = time.time() - t # Stop timing
print("X.shape\t: {}".format(X.shape))
print("Elapsed time\t: {}".format(elapsed))
t = time.time() # Start timing
X1 = [df.loc[ind: ind+n_D-1, 'd_ret'].values for ind, _ in df.iterrows()]
X2 = [lst for lst in X1 if len(lst) == n_D]
X_np = np.array(X2) # Get np array as output
elapsed = time.time() - t # Stop timing
print("X_np.shape\t: {}".format(X_np.shape))
print("Elapsed time\t: {}".format(elapsed))
输出
X.shape : (5937, 64)
Elapsed time : 0.37702155113220215
X_np.shape : (5937, 64)
Elapsed time : 0.7020401954650879
如何矢量化呢?
示例输入/输出
# Input
Input = pd.Series(range(128))
# Output
array([[ 0., 1., 2., ..., 61., 62., 63.],
[ 1., 2., 3., ..., 62., 63., 64.],
[ 2., 3., 4., ..., 63., 64., 65.],
...,
[ 62., 63., 64., ..., 123., 124., 125.],
[ 63., 64., 65., ..., 124., 125., 126.],
[ 64., 65., 66., ..., 125., 126., 127.]])
答案 0 :(得分:0)
您可以使用reshape
df.d_ret.values.reshape(-1, 64)
答案 1 :(得分:0)
也许不能完全向量化,但是与for循环相比,python中的列表并置确实很有效。
假设df为格式
>>> df.head()
d_ret
0 0
1 1
2 2
3 3
4 4
你不能只是做
X = [df.d_ret[df.loc[ind: ind+n_D-1, 'd_ret']].values for ind, _ in df.iterrows()]
然后将长度为
X1 = [lst for lst in X if len(lst) == n_D]
然后我得到例如:
>>> print X1[2]
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65]
和np.array(X1).shape
>>> np.array(X1).shape
(937, 64)
937,64 = 1000-64 + 1,64 = df.count()-n_D + 1,n_D
让我知道这是否是你想要的:)
答案 2 :(得分:0)
Numpy Vectorization of sliding-window operation中最快的矢量化解决方案使用以下关键行:
idx = np.arange(m)[:,None] + np.arange(n_D)
out = df.values[idx].squeeze()
在此处应用于我的示例:
# Set up a dataframe with 6000 random samples
df = pd.DataFrame(np.random.rand(6000),columns=['d_ret'])
days_of_data = df['d_ret'].count()
n_D = 64 # Window size
# The dataset will have m = (days_of_data - n_D + 1) rows
m = days_of_data - n_D + 1
t = time.time() # Start timing
# This line creates and array of indices that is then used to access
# the df.values numpy array. I do not understand how this works...
idx = np.arange(m)[:,None] + np.arange(n_D) # Don't understand this
out = df.values[idx].squeeze() # Remove an extra dimension
elapsed = time.time() - t # Stop timing
print("out.shape\t: {}".format(out.shape))
print("Elapsed time\t: {}".format(elapsed))
输出
out.shape : (5937, 64)
Elapsed time : 0.003000020980834961