给出以下包含60个元素的pandas数据框。
import pandas as pd
data = [60,62.75,73.28,75.77,70.28
,67.85,74.58,72.91,68.33,78.59
,75.58,78.93,74.61,85.3,84.63
,84.61,87.76,95.02,98.83,92.44
,84.8,89.51,90.25,93.82,86.64
,77.84,76.06,77.75,72.13,80.2
,79.05,76.11,80.28,76.38,73.3
,72.28,77,69.28,71.31,79.25
,75.11,73.16,78.91,84.78,85.17
,91.53,94.85,87.79,97.92,92.88
,91.92,88.32,81.49,88.67,91.46
,91.71,82.17,93.05,103.98,105]
data_pd = pd.DataFrame(data, columns=["price"])
是否有一个公式以这种方式重新缩放,以便对于从索引0
到索引i+1
的大于20个元素的每个窗口,数据重新缩放到20个元素?
这是一个创建带有重新缩放数据的窗口的循环,我只是不知道为手头的这个问题做任何重新缩放的方法。关于如何做到这一点的任何建议?
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd[0:i]
scaledDataToMinLenght = dataForScaling #do the scaling here so that the length of the rescaled data is always equal to miniLenght
rescaledData.append(scaledDataToMinLenght)
基本上在重新缩放后,rescaledData
应该有40个数组,每个数组的长度为20个。
答案 0 :(得分:3)
从阅读论文开始,看起来您正在将列表调整回20个索引,然后在20个索引处插入数据。
我们会像他们那样制作索引(range(0, len(large), step = len(large)/miniLenght)
),然后使用numpys interp - 有一百万种插值方法。 np.interp使用线性插值,所以如果你要求例如索引1.5,你得到点1和2的平均值,依此类推。
所以,这里是你的代码的快速修改(nb,我们可以使用'rolling'完全对它进行矢量化):
import numpy as np
miniLenght = 20
rescaledData = []
for i in range(len(data_pd)):
if(i >= miniLenght):
dataForScaling = data_pd['price'][0:i]
#figure out how many 'steps' we have
steps = len(dataForScaling)
#make indices where the data needs to be sliced to get 20 points
indices = np.arange(0,steps, step = steps/miniLenght)
#use np.interp at those points, with the original values as given
rescaledData.append(np.interp(indices, np.arange(steps), dataForScaling))
输出符合预期:
[array([ 60. , 62.75, 73.28, 75.77, 70.28, 67.85, 74.58, 72.91,
68.33, 78.59, 75.58, 78.93, 74.61, 85.3 , 84.63, 84.61,
87.76, 95.02, 98.83, 92.44]),
array([ 60. , 63.2765, 73.529 , 74.9465, 69.794 , 69.5325,
74.079 , 71.307 , 72.434 , 77.2355, 77.255 , 76.554 ,
81.024 , 84.8645, 84.616 , 86.9725, 93.568 , 98.2585,
93.079 , 85.182 ]),.....