Question

我有以下代码用于制作数据帧的序列，该数据帧已加载降雨率的csv数据。

import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
seq_len  = 1100

def load_data(df_, seq_len):
    data_raw = df_.values # convert to numpy array
    data = []
    data = np.array([data_raw[index: index + seq_len] for index in range(len(data_raw) - (seq_len+1))])
    print(data.shape)

df = pd.read_csv("data.csv",index_col = 0)
temp = df.copy()
temp = normalize_data(temp)
load_data(df_, seq_len)temp

当我运行函数load_data(df_, seq_len)temp时，我不得不等待很多时间。我不明白这是否是seq_len的问题。

这是附加的数据集：data.csv

请帮助我加快速度。将来可能会有更大的数据。但是，如果这一步变得更快，我不必担心未来的数据。 **编辑：**根据@ParitoshSingh注释。.这是数据集的一部分。但是不要认为这是数据。它只是更大数据的一部分：

,rains_ratio_2013,rains_ratio_2014
0,1.12148,1.1216
1,1.12141,1.12162
2,1.12142,1.12163
3,1.12148,1.1216
4,1.12143,1.12165
5,1.12141,1.12161
6,1.1213799999999998,1.12161
7,1.1214,1.12158
8,1.1214,1.12158
9,1.12141,1.12158
10,1.12141,1.12161
11,1.12144,1.1215899999999999
12,1.12141,1.12162
13,1.12141,1.12161
14,1.12143,1.12161
15,1.12143,1.1216899999999999
16,1.12143,1.12173
17,1.12143,1.12178
18,1.1214600000000001,1.12179
19,1.12148,1.12174
20,1.12148,1.1217
21,1.12148,1.12174
22,1.12148,1.1217
23,1.12145,1.1217
24,1.12145,1.1217
25,1.12148,1.1217
26,1.1214899999999999,1.1217
27,1.1214899999999999,1.1216899999999999
28,1.12143,1.1216899999999999
29,1.12143,1.1216899999999999
30,1.12144,1.1216899999999999

Answer 1

这本质上是一个滑动窗口问题。

一种方法是使用矢量化来更快地滑动数据上的滑动窗口。请注意，如果您没有足够的内存来加载最终的输出数据，那么这也可能会引起问题。

import numpy as np
import pandas as pd

创建一些虚拟数据框以便于使用。您应该在原始数据框上进行测试。

seq_len = 5
df = pd.DataFrame(np.arange(300).reshape(-1, 3))
print(df.head())
#Output:
    0   1   2
0   0   1   2
1   3   4   5
2   6   7   8
3   9  10  11
4  12  13  14

现在，我们可以为需要使用的所有索引创建一个数组，并使用索引以所需的格式访问所有值。

def load_data(df_, seq_len):
    data_raw = df_.values # convert to numpy array
    #find total number of rows
    nrows = len(data_raw) - seq_len + 1 #Your code had -(seq_len + 1) for some reason. i am assuming that was just a mistake. If not, correct this accordingly.
    #Now, create an index matrix from the total number of rows.
    data = data_raw[np.arange(nrows)[:,None] + np.arange(seq_len)] 
    print("shape is", data.shape)
    return data

out = load_data(df, seq_len)
#Output: shape is (98, 3, 3)

编辑：如果遇到内存错误，则始终可以修改函数以使用生成器。这样，您可以在一个接一个地迭代或消耗太多内存的两种情况之间取一个中间立场。

def load_data_gen(df_, seq_len, chunksize=10):
    data_raw = df_.values # convert to numpy array
    nrows = len(data_raw) - seq_len + 1
    for i in range(0, nrows, chunksize):
        data = data_raw[np.arange(i, min(i+chunksize, nrows))[:,None] + np.arange(seq_len)]
        print("shape is", data.shape)
        yield data

out = load_data_gen(df, seq_len, 15)
test = list(out)
#Output:
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (6, 5, 3)

使用Python加快将数据序列生成到数组的速度

1 个答案: