我在Pandas数据帧上使用了FeatureHasher
包中的sklearn.feature_extraction
。
我想从文本特征中生成哈希特征。
对于3个原始功能,它可以正常工作,并且新列包含值。对于一个功能,所有新列仅包含NaN
个值。
import pandas as pd
from sklearn.feature_extraction import FeatureHasher
X = data.iloc[:,0:13]
y = data.iloc[:,-1]
h1 = FeatureHasher(n_features=16,input_type='string')
h2 = FeatureHasher(n_features=16,input_type='string')
h3 = FeatureHasher(n_features=32,input_type='string')
h4 = FeatureHasher(n_features=32,input_type='string')
f1 = h1.fit_transform(X['user_state'])
f2 = h2.fit_transform(X['app_id'])
f3 = h3.fit_transform(X['user_isp'])
f4 = h4.fit_transform(X['device_model'])
a1=f1.toarray()
a2=f2.toarray()
a3=f3.toarray()
a4=f4.toarray()
state_col_names=['state'+str(i) for i in range(16)]
app_col_names=['app'+str(i) for i in range(16)]
isp_col_names=['isp'+str(i) for i in range(32)]
device_model_names=['dev_mod'+str(i) for i in range(32)]
data_proc=pd.concat([X,pd.DataFrame(a1,columns=state_col_names),pd.DataFrame(a2,columns=app_col_names)/
pd.DataFrame(a3,columns=isp_col_names),pd.DataFrame(a4,columns=device_model_names),y],axis=1)
data_proc=data_proc.drop(["user_state","user_isp","device_maker","device_model","device_osv","device_height",\
"device_width","device_area","marketplace","app_category_primary"],axis=1)
对于['user_state']
,['app_id']
,['device_model']
,我得到一个有效的输出(数字)。
对于['app_id']
,我在新列中仅得到NaN
个值。
有人知道吗?