我正在构建一个神经网络,并打算在许多独立(类别)变量上使用OneHotEncoder。我想知道是否可以正确地使用虚拟变量,或者由于我的所有变量都需要虚拟变量,所以可能会有更好的方法。
df
UserName Token ThreadID ChildEXE
0 TAG TokenElevationTypeDefault (1) 20788 splunk-MonitorNoHandle.exe
1 TAG TokenElevationTypeDefault (1) 19088 splunk-optimize.exe
2 TAG TokenElevationTypeDefault (1) 2840 net.exe
807 User TokenElevationTypeFull (2) 18740 E2CheckFileSync.exe
808 User TokenElevationTypeFull (2) 18740 E2check.exe
809 User TokenElevationTypeFull (2) 18740 E2check.exe
811 Local TokenElevationTypeFull (2) 18740 sc.exe
ParentEXE ChildFilePath ParentFilePath
splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
splunkd.exe C:\Program Files\Splunk\bin C:\Program Files\Splunk\bin 0
dagent.exe C:\Windows\System32 C:\Program Files\Dagent 0
wscript.exe \Device\Mup\sysvol C:\Windows 1
E2CheckFileSync.exe C:\Util \Device\Mup\sysvol\ 1
cmd.exe C:\Windows\SysWOW64 C:\Util\E2Check 1
cmd.exe C:\Windows C:\Windows\SysWOW64 1
DependentVariable
0
0
0
1
1
1
1
我导入数据并在自变量上使用LabelEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
#IMPORT DATA
#Matrix x of features
X = df.iloc[:, 0:7].values
#Dependent variable
y = df.iloc[:, 7].values
#Encoding Independent Variable
#Need a label encoder for every categorical variable
#Converts categorical into number - set correct index of column
#Encode "UserName"
labelencoder_X_1 = LabelEncoder()
X[:, 0] = labelencoder_X_1.fit_transform(X[:, 0])
#Encode "Token"
labelencoder_X_2 = LabelEncoder()
X[:, 1] = labelencoder_X_2.fit_transform(X[:, 1])
#Encode "ChildEXE"
labelencoder_X_3 = LabelEncoder()
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
#Encode "ParentEXE"
labelencoder_X_4 = LabelEncoder()
X[:, 4] = labelencoder_X_4.fit_transform(X[:, 4])
#Encode "ChildFilePath"
labelencoder_X_5 = LabelEncoder()
X[:, 5] = labelencoder_X_5.fit_transform(X[:, 5])
#Encode "ParentFilePath"
labelencoder_X_6 = LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])
这给了我以下数组:
X
array([[2, 0, 20788, ..., 46, 31, 24],
[2, 0, 19088, ..., 46, 31, 24],
[2, 0, 2840, ..., 27, 42, 15],
...,
[2, 0, 20148, ..., 17, 40, 32],
[2, 0, 20148, ..., 47, 23, 0],
[2, 0, 3176, ..., 48, 42, 32]], dtype=object)
现在,对于所有自变量,我必须创建虚拟变量:
我应该使用:
onehotencoder = OneHotEncoder(categorical_features = [0, 1, 2, 3, 4, 5, 6])
X = onehotencoder.fit_transform(X).toarray()
哪个给我:
X
array([[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 1., 0., 0.]])
还是有更好的方法来解决这个问题?
答案 0 :(得分:1)
您也可以尝试: X = pd.get_dummies(X,columns = [0,1,2,3,4,5,6],drop_first = True)
'drop_first = True'将您从虚拟变量陷阱中救出来。
答案 1 :(得分:0)
这是我能找到和工作的最好的东西:
onehotencoder = OneHotEncoder(categorical_features = [0,1,2,3,4,5,6])
X = onehotencoder.fit_transform(X).toarray()