array(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'], dtype=object)
这是我的数据集中的一项功能。数组中包含的所有值都是唯一的。唯一值的长度为70。每个唯一值均被视为一个类别。我想将此功能集转换为一种热编码。 我想说的是,如果一行包含“ ftp_data”,则应该将其热编码为1000000 .....,以此类推。 我知道一种为每个单词分配数值的方法,用数值替换数据集中的单词,然后使用one_hot_encoding方法。我希望是否有其他方法可以将我的数据集从单词直接转换为单编码 有人可以在熊猫中提供帮助的方法吗?
答案 0 :(得分:1)
我认为您正在寻找pandas.get_dummies
s=pd.Series(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'])
one_hot=pd.get_dummies(s,dtype=int).T.apply(lambda x: ''.join(x.astype(str).tolist()),axis=1).sort_values(ascending=False)
print(one_hot)
ftp_data 1000000000000000000000000000000000000000000000...
other 0100000000000000000000000000000000000000000000...
private 0010000000000000000000000000000000000000000000...
http 0001000000000000000000000000000000000000000000...
remote_job 0000100000000000000000000000000000000000000000...
...
http_8001 0000000000000000000000000000000000000000000000...
aol 0000000000000000000000000000000000000000000000...
http_2784 0000000000000000000000000000000000000000000000...
tftp_u 0000000000000000000000000000000000000000000000...
harvest 0000000000000000000000000000000000000000000000...
Length: 70, dtype: object
print(one_hot.head(50))
ftp_data 1000000000000000000000000000000000000000000000...
other 0100000000000000000000000000000000000000000000...
private 0010000000000000000000000000000000000000000000...
http 0001000000000000000000000000000000000000000000...
remote_job 0000100000000000000000000000000000000000000000...
name 0000010000000000000000000000000000000000000000...
netbios_ns 0000001000000000000000000000000000000000000000...
eco_i 0000000100000000000000000000000000000000000000...
mtp 0000000010000000000000000000000000000000000000...
telnet 0000000001000000000000000000000000000000000000...
finger 0000000000100000000000000000000000000000000000...
domain_u 0000000000010000000000000000000000000000000000...
supdup 0000000000001000000000000000000000000000000000...
uucp_path 0000000000000100000000000000000000000000000000...
Z39_50 0000000000000010000000000000000000000000000000...
smtp 0000000000000001000000000000000000000000000000...
csnet_ns 0000000000000000100000000000000000000000000000...
uucp 0000000000000000010000000000000000000000000000...
netbios_dgm 0000000000000000001000000000000000000000000000...
urp_i 0000000000000000000100000000000000000000000000...
auth 0000000000000000000010000000000000000000000000...
domain 0000000000000000000001000000000000000000000000...
ftp 0000000000000000000000100000000000000000000000...
bgp 0000000000000000000000010000000000000000000000...
ldap 0000000000000000000000001000000000000000000000...
ecr_i 0000000000000000000000000100000000000000000000...
gopher 0000000000000000000000000010000000000000000000...
vmnet 0000000000000000000000000001000000000000000000...
systat 0000000000000000000000000000100000000000000000...
http_443 0000000000000000000000000000010000000000000000...
efs 0000000000000000000000000000001000000000000000...
whois 0000000000000000000000000000000100000000000000...
imap4 0000000000000000000000000000000010000000000000...
iso_tsap 0000000000000000000000000000000001000000000000...
echo 0000000000000000000000000000000000100000000000...
klogin 0000000000000000000000000000000000010000000000...
link 0000000000000000000000000000000000001000000000...
sunrpc 0000000000000000000000000000000000000100000000...
login 0000000000000000000000000000000000000010000000...
kshell 0000000000000000000000000000000000000001000000...
sql_net 0000000000000000000000000000000000000000100000...
time 0000000000000000000000000000000000000000010000...
hostnames 0000000000000000000000000000000000000000001000...
exec 0000000000000000000000000000000000000000000100...
ntp_u 0000000000000000000000000000000000000000000010...
discard 0000000000000000000000000000000000000000000001...
nntp 0000000000000000000000000000000000000000000000...
courier 0000000000000000000000000000000000000000000000...
ctf 0000000000000000000000000000000000000000000000...
ssh 0000000000000000000000000000000000000000000000...
dtype: object
如何浮动:
print(one_hot.astype(float))
ftp_data 1.000000e+69
other 1.000000e+68
private 1.000000e+67
http 1.000000e+66
remote_job 1.000000e+65
...
http_8001 1.000000e+04
aol 1.000000e+03
http_2784 1.000000e+02
tftp_u 1.000000e+01
harvest 1.000000e+00
Length: 70, dtype: float64
请注意,astype(int)出现错误