我的数据集中有文本。我想将其转换为一种热编码

时间:2019-09-26 16:13:26

标签: python pandas machine-learning data-mining

array(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
   'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
   'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
   'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
   'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
   'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
   'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
   'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
   'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
   'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
   'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'], dtype=object)

这是我的数据集中的一项功能。数组中包含的所有值都是唯一的。唯一值的长度为70。每个唯一值均被视为一个类别。我想将此功能集转换为一种热编码。 我想说的是,如果一行包含“ ftp_data”,则应该将其热编码为1000000 .....,以此类推。 我知道一种为每个单词分配数值的方法,用数值替换数据集中的单词,然后使用one_hot_encoding方法。我希望是否有其他方法可以将我的数据集从单词直接转换为单编码 有人可以在熊猫中提供帮助的方法吗?

1 个答案:

答案 0 :(得分:1)

我认为您正在寻找pandas.get_dummies

s=pd.Series(['ftp_data', 'other', 'private', 'http', 'remote_job', 'name',
   'netbios_ns', 'eco_i', 'mtp', 'telnet', 'finger', 'domain_u',
   'supdup', 'uucp_path', 'Z39_50', 'smtp', 'csnet_ns', 'uucp',
   'netbios_dgm', 'urp_i', 'auth', 'domain', 'ftp', 'bgp', 'ldap',
   'ecr_i', 'gopher', 'vmnet', 'systat', 'http_443', 'efs', 'whois',
   'imap4', 'iso_tsap', 'echo', 'klogin', 'link', 'sunrpc', 'login',
   'kshell', 'sql_net', 'time', 'hostnames', 'exec', 'ntp_u',
   'discard', 'nntp', 'courier', 'ctf', 'ssh', 'daytime', 'shell',
   'netstat', 'pop_3', 'nnsp', 'IRC', 'pop_2', 'printer', 'tim_i',
   'pm_dump', 'red_i', 'netbios_ssn', 'rje', 'X11', 'urh_i',
   'http_8001', 'aol', 'http_2784', 'tftp_u', 'harvest'])
one_hot=pd.get_dummies(s,dtype=int).T.apply(lambda x: ''.join(x.astype(str).tolist()),axis=1).sort_values(ascending=False)
print(one_hot)



ftp_data      1000000000000000000000000000000000000000000000...
other         0100000000000000000000000000000000000000000000...
private       0010000000000000000000000000000000000000000000...
http          0001000000000000000000000000000000000000000000...
remote_job    0000100000000000000000000000000000000000000000...
                                    ...                        
http_8001     0000000000000000000000000000000000000000000000...
aol           0000000000000000000000000000000000000000000000...
http_2784     0000000000000000000000000000000000000000000000...
tftp_u        0000000000000000000000000000000000000000000000...
harvest       0000000000000000000000000000000000000000000000...
Length: 70, dtype: object

print(one_hot.head(50))

ftp_data       1000000000000000000000000000000000000000000000...
other          0100000000000000000000000000000000000000000000...
private        0010000000000000000000000000000000000000000000...
http           0001000000000000000000000000000000000000000000...
remote_job     0000100000000000000000000000000000000000000000...
name           0000010000000000000000000000000000000000000000...
netbios_ns     0000001000000000000000000000000000000000000000...
eco_i          0000000100000000000000000000000000000000000000...
mtp            0000000010000000000000000000000000000000000000...
telnet         0000000001000000000000000000000000000000000000...
finger         0000000000100000000000000000000000000000000000...
domain_u       0000000000010000000000000000000000000000000000...
supdup         0000000000001000000000000000000000000000000000...
uucp_path      0000000000000100000000000000000000000000000000...
Z39_50         0000000000000010000000000000000000000000000000...
smtp           0000000000000001000000000000000000000000000000...
csnet_ns       0000000000000000100000000000000000000000000000...
uucp           0000000000000000010000000000000000000000000000...
netbios_dgm    0000000000000000001000000000000000000000000000...
urp_i          0000000000000000000100000000000000000000000000...
auth           0000000000000000000010000000000000000000000000...
domain         0000000000000000000001000000000000000000000000...
ftp            0000000000000000000000100000000000000000000000...
bgp            0000000000000000000000010000000000000000000000...
ldap           0000000000000000000000001000000000000000000000...
ecr_i          0000000000000000000000000100000000000000000000...
gopher         0000000000000000000000000010000000000000000000...
vmnet          0000000000000000000000000001000000000000000000...
systat         0000000000000000000000000000100000000000000000...
http_443       0000000000000000000000000000010000000000000000...
efs            0000000000000000000000000000001000000000000000...
whois          0000000000000000000000000000000100000000000000...
imap4          0000000000000000000000000000000010000000000000...
iso_tsap       0000000000000000000000000000000001000000000000...
echo           0000000000000000000000000000000000100000000000...
klogin         0000000000000000000000000000000000010000000000...
link           0000000000000000000000000000000000001000000000...
sunrpc         0000000000000000000000000000000000000100000000...
login          0000000000000000000000000000000000000010000000...
kshell         0000000000000000000000000000000000000001000000...
sql_net        0000000000000000000000000000000000000000100000...
time           0000000000000000000000000000000000000000010000...
hostnames      0000000000000000000000000000000000000000001000...
exec           0000000000000000000000000000000000000000000100...
ntp_u          0000000000000000000000000000000000000000000010...
discard        0000000000000000000000000000000000000000000001...
nntp           0000000000000000000000000000000000000000000000...
courier        0000000000000000000000000000000000000000000000...
ctf            0000000000000000000000000000000000000000000000...
ssh            0000000000000000000000000000000000000000000000...
dtype: object

如何浮动:

print(one_hot.astype(float))

ftp_data      1.000000e+69
other         1.000000e+68
private       1.000000e+67
http          1.000000e+66
remote_job    1.000000e+65
                  ...     
http_8001     1.000000e+04
aol           1.000000e+03
http_2784     1.000000e+02
tftp_u        1.000000e+01
harvest       1.000000e+00
Length: 70, dtype: float64

请注意,astype(int)出现错误