如何创建一个新的数据框

时间:2019-07-19 10:54:40

标签: python scikit-learn

传递的值的形状为(1000,10),索引表示(1000,11)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.columns)

2 个答案:

答案 0 :(得分:0)

错误

  

传递的值的形状为(1000,10),索引表示(1000,11)

发生在此行

df_feat = pd.DataFrame(scaled_features,columns=df.columns)

因为scaled_features有10列,但是df.columns的长度为11。


请注意,两次调用df.drop('TARGET CLASS',axis=1)TARGET CLASS中删除df列。您似乎要从新列列表中删除df中的多余列。

可以通过保存对df.drop('TARGET CLASS',axis=1)的引用来解决此问题 (我们称其为df_minus_target),并将df_minus_target.columns传递为新的列列表:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_minus_target = df.drop('TARGET CLASS',axis=1)
scaler.fit(df_minus_target)
scaled_features = scaler.transform(df_minus_target)
df_feat = pd.DataFrame(scaled_features,columns=df_minus_target.columns)

答案 1 :(得分:0)

提取列以创建df_feat数据框(应为pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns)时,您忘记了从df中删除最后一列,请参见下面的整个可复制示例:

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)

scaler = StandardScaler()
scaler.fit(df.drop('TARGET CLASS',axis=1))
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))
df_feat = pd.DataFrame(scaled_features,columns=df.drop('TARGET CLASS',axis=1).columns)

print(df_feat)

或者为了防止将来发生此类错误,请先将要处理的功能列提取到单独的数据框中:

import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

# Mock your dataset:
df = pd.DataFrame(np.random.rand(5, 10))
df = pd.concat([df, pd.Series([1, 1, 0, 0, 1], name='TARGET CLASS')], axis=1)

# Extract raw features columns first.
df_feat = df.drop('TARGET CLASS', axis=1)

# Do transformations.
scaler = StandardScaler()
scaler.fit(df_feat)
scaled_features = scaler.transform(df_feat)
df_feat_scaled = pd.DataFrame(scaled_features, columns=df_feat.columns)

print(df_feat_scaled)