Question

我正在尝试执行以下操作

df = pd.read_csv('a.csv')
scaler = MinMaxScaler()

df_copy = df.copy(deep=True)

for i in range(1, len(df)):

  df_chunk = df_copy.iloc[i,i+10]

  df_chunk = scaler.fit_transform (df_chunk)

因此每个df_chunk应该是缩放的数据帧。

问题是某些缩放比例不正确。

如果要绘制比例缩放的数据点，则正确比例缩放的数据帧将看起来像是散布在0到1之间的数字范围。但是我得到的数据帧有两个极端，前80％的数据帧在0.9范围内，而其他帧在0.1范围内。

因此，感觉好像第一个〜80％的数据被缩放器缩放了两次。我已经尝试使用熊猫深层复制来解决此问题，但这似乎无济于事。

如果您有任何想法，为什么？

我真的很感激。

Answer 1

我不太确定为什么要在数据块上应用缩放器。如果您担心CSV可能太大，则可以在read_csv操作中按块读取CSV并处理这些块。

现在进入您的问题。您要在每个块上重新安装缩放器，这就是为什么您得到奇怪的结果的原因。您要么必须使用缩放器来拟合整个数据，要么必须使用partial_fit方法来在线拟合数据。

我将为您提供两种解决方案。

解决方案1：读取并拟合整个数据

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df = pd.read_csv('a.csv')
df_scaled = scaler.fit_transform(df)

解决方案2：按块读取csv，然后在线培训

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# first read the csv by chunks and update the scaler
for chunk in pd.read_csv('a.csv', chunksize=10):
    scaler.partial_fit(chunk)

# read the csv again by chunks to transform the chunks
for chunk in pd.read_csv('a.csv', chunksize=10):
    transformed = scaler.transform(chunk)
    # not too sure what you want to do after this
    # but you can either print the results of the transformation
    # or write the transformed chunk to a new csv

熊猫深度复制和scikit学习最小最大缩放器

1 个答案: