Sklearn-特征哈希在熊猫上生成NaN值

时间:2019-06-01 16:58:26

标签: python sklearn-pandas

我在Pandas数据帧上使用了FeatureHasher包中的sklearn.feature_extraction

我想从文本特征中生成哈希特征。

对于3个原始功能,它可以正常工作,并且新列包含值。对于一个功能,所有新列仅包含NaN个值。

import pandas as pd
from sklearn.feature_extraction import FeatureHasher

X = data.iloc[:,0:13]  
y = data.iloc[:,-1] 

h1 = FeatureHasher(n_features=16,input_type='string')
h2 = FeatureHasher(n_features=16,input_type='string')
h3 = FeatureHasher(n_features=32,input_type='string')
h4 = FeatureHasher(n_features=32,input_type='string')

f1 = h1.fit_transform(X['user_state'])
f2 = h2.fit_transform(X['app_id'])
f3 = h3.fit_transform(X['user_isp'])
f4 = h4.fit_transform(X['device_model'])

a1=f1.toarray()
a2=f2.toarray()
a3=f3.toarray()
a4=f4.toarray()

state_col_names=['state'+str(i) for i in range(16)]
app_col_names=['app'+str(i) for i in range(16)]
isp_col_names=['isp'+str(i) for i in range(32)]
device_model_names=['dev_mod'+str(i) for i in range(32)]
data_proc=pd.concat([X,pd.DataFrame(a1,columns=state_col_names),pd.DataFrame(a2,columns=app_col_names)/
                     pd.DataFrame(a3,columns=isp_col_names),pd.DataFrame(a4,columns=device_model_names),y],axis=1)

data_proc=data_proc.drop(["user_state","user_isp","device_maker","device_model","device_osv","device_height",\
                    "device_width","device_area","marketplace","app_category_primary"],axis=1)

对于['user_state']['app_id']['device_model'],我得到一个有效的输出(数字)。 对于['app_id'],我在新列中仅得到NaN个值。

有人知道吗?

0 个答案:

没有答案