我有两个CSV文件(Training set和Test Set)。由于很少列(NaN
,status
,hedge_value
,indicator_code
,portfolio_id
,{{1}都有可见的desk_id
值})。
我通过将office_id
值替换为与列对应的一些巨大值来启动该过程。
然后我正在NaN
删除文本数据并将其转换为数字数据。
现在,当我尝试对分类数据执行LabelEncoding
时,我收到错误。我尝试将输入逐个输入到OneHotEncoding
构造函数中,但每列都会出现相同的错误。
基本上,我的最终目标是预测返回值,但由于这个原因,我被困在数据预处理部分。我该如何解决这个问题?
我正在使用OneHotEncoding
与Python3.6
和Pandas
进行数据处理。
代码
Sklearn
错误
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
test_data = pd.read_csv('test.csv')
train_data = pd.read_csv('train.csv')
# Replacing Nan values here
train_data['status']=train_data['status'].fillna(2.0)
train_data['hedge_value']=train_data['hedge_value'].fillna(2.0)
train_data['indicator_code']=train_data['indicator_code'].fillna(2.0)
train_data['portfolio_id']=train_data['portfolio_id'].fillna('PF99999999')
train_data['desk_id']=train_data['desk_id'].fillna('DSK99999999')
train_data['office_id']=train_data['office_id'].fillna('OFF99999999')
x_train = train_data.iloc[:, :-1].values
y_train = train_data.iloc[:, 17].values
# =============================================================================
# from sklearn.preprocessing import Imputer
# imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
# imputer.fit(x_train[:, 15:17])
# x_train[:, 15:17] = imputer.fit_transform(x_train[:, 15:17])
#
# imputer.fit(x_train[:, 12:13])
# x_train[:, 12:13] = imputer.fit_transform(x_train[:, 12:13])
# =============================================================================
# Encoding categorical data, i.e. Text data, since calculation happens on numbers only, so having text like
# Country name, Purchased status will give trouble
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
x_train[:, 0] = labelencoder_X.fit_transform(x_train[:, 0])
x_train[:, 1] = labelencoder_X.fit_transform(x_train[:, 1])
x_train[:, 2] = labelencoder_X.fit_transform(x_train[:, 2])
x_train[:, 3] = labelencoder_X.fit_transform(x_train[:, 3])
x_train[:, 6] = labelencoder_X.fit_transform(x_train[:, 6])
x_train[:, 8] = labelencoder_X.fit_transform(x_train[:, 8])
x_train[:, 14] = labelencoder_X.fit_transform(x_train[:, 14])
# =============================================================================
# import numpy as np
# x_train[:, 3] = x_train[:, 3].reshape(x_train[:, 3].size,1)
# x_train[:, 3] = x_train[:, 3].astype(np.float64, copy=False)
# np.isnan(x_train[:, 3]).any()
# =============================================================================
# =============================================================================
# from sklearn.preprocessing import StandardScaler
# sc_X = StandardScaler()
# x_train = sc_X.fit_transform(x_train)
# =============================================================================
onehotencoder = OneHotEncoder(categorical_features=[0,1,2,3,6,8,14])
x_train = onehotencoder.fit_transform(x_train).toarray() # Replace Country Names with One Hot Encoding.
答案 0 :(得分:7)
我在发布问题后再次浏览数据集,然后找到另一个列为NaN
的列。当我可以使用Pandas函数获取NaN
列的列表时,我无法相信我浪费了太多时间。因此,使用以下代码,我发现我错过了三列。当我刚刚使用此功能时,我在视觉上搜索NaN
。处理完这些新NaN
后,代码运行正常。
pd.isnull(train_data).sum() > 0
结果
portfolio_id False
desk_id False
office_id False
pf_category False
start_date False
sold True
country_code False
euribor_rate False
currency False
libor_rate True
bought True
creation_date False
indicator_code False
sell_date False
type False
hedge_value False
status False
return False
dtype: bool
答案 1 :(得分:1)
错误在于您将其他功能视为非分类功能。
'indicator_code'
,TRUE
等其他列包含来自原始csv的FALSE
,2.0
和来自fillna()
的{{1}}等混合类型数据打电话。 OneHotEncoder无法处理它们。
如OneHotEncoder fit()
文档中所述:
fit(X, y=None)
Fit OneHotEncoder to X.
Parameters:
X : array-like, shape [n_samples, n_feature]
Input array of type int.
你可以看到它要求所有X都是数字(int,但是浮点数)类型。
作为解决方法,您可以执行此操作来编码分类功能:
X_train_categorical = x_train[:, [0,1,2,3,6,8,14]]
onehotencoder = OneHotEncoder()
X_train_categorical = onehotencoder.fit_transform(X_train_categorical).toarray()
然后将其与您的非分类功能相结合。
答案 2 :(得分:0)
要在生产中使用它,最佳实践是使用Imputer,然后将其与模型一起保存在pkl中
这很麻烦
df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)
更好地使用this