Question

我知道特征散列（散列技巧）用于减少维度并处理位向量的稀疏性，但我不明白它是如何工作的。任何人都可以向我解释一下。是否有任何python库可用于进行功能散列？

谢谢。

Answer 1

在Pandas上，您可以使用以下内容：

import pandas as pd
import numpy as np

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

data = pd.DataFrame(data)

def hash_col(df, col, N):
    cols = [col + "_" + str(i) for i in range(N)]
    def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

print hash_col(data, 'state',4)

输出为

   pop  year  state_0  state_1  state_2  state_3
0  1.5  2000        0        1        0        0
1  1.7  2001        0        1        0        0
2  3.6  2002        0        1        0        0
3  2.4  2001        0        0        0        1
4  2.9  2002        0        0        0        1

同样在系列级别，您可以

将numpy导入为np，os import sys，pandas as pd

def hash_col(df, col, N):
    df = df.replace('',np.nan)
    cols = [col + "_" + str(i) for i in range(N)]
    tmp = [0 for i in range(N)]
    tmp[hash(df.ix[col]) % N] = 1
    res = df.append(pd.Series(tmp,index=cols))
    return res.drop(col)

a = pd.Series(['new york',30,''],index=['city','age','test'])
b = pd.Series(['boston',30,''],index=['city','age','test'])

print hash_col(a,'city',10)
print hash_col(b,'city',10)

这将适用于每个系列，列名将被假定为Pandas索引。它还用nan替换空字符串，并浮动所有内容。

age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      0
city_6      0
city_7      1
city_8      0
city_9      0
dtype: object
age        30
test      NaN
city_0      0
city_1      0
city_2      0
city_3      0
city_4      0
city_5      1
city_6      0
city_7      0
city_8      0
city_9      0
dtype: object

但是，如果有词汇表，并且您只想进行单热编码，则可以使用

import numpy as np
import pandas as pd, os
import scipy.sparse as sps

def hash_col(df, col, vocab):
    cols = [col + "=" + str(v) for v in vocab]
    def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols)
    df[cols] = df[col].apply(xform)
    return df.drop(col,axis=1)

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

df = pd.DataFrame(data)

df2 = hash_col(df, 'state', ['Ohio','Nevada'])

print sps.csr_matrix(df2)

将给出

   pop  year  state=Ohio  state=Nevada
0  1.5  2000           1             0
1  1.7  2001           1             0
2  3.6  2002           1             0
3  2.4  2001           0             1
4  2.9  2002           0             1

我还添加了最终数据帧的稀疏化。在增量设置中，我们可能没有预先遇到所有值（但我们以某种方式以某种方式获得了所有可能值的列表），可以使用上面的方法。增量ML方法在每个增量处需要相同数量的特征，因此单热编码必须在每个批处理中产生相同数量的行。

Answer 2

Here（抱歉，由于某些原因，我无法将其添加为评论。）此外，Feature Hashing for Large Scale Multitask Learning的第一页很好地解释了这一点。

Answer 3

大型稀疏功能可以从交互派生，U代表用户，X代表电子邮件，因此U x X的维度是内存密集型。通常，垃圾邮件过滤等任务也有时间限制。

哈希技巧与其他哈希函数存储二进制位（索引）一样，使大规模训练可行。理论上，如原始论文所示，更多的散列长度可以获得更多的性能提升。

它将原始特征分配到不同的桶（特征空间的有限长度）中，以便保持它们的语义。即使垃圾邮件发送者使用拼写错误而错过了雷达。虽然存在失真错误，但是继承人的形式仍然很接近。

例如，

＆＃34;快速的棕色狐狸＆＃34;转换为：

h(the) mod 5 = 0

h(quick) mod 5 = 1

h(brown) mod 5 = 1

h(fox) mod 5 = 3

使用索引而不是文本值，节省空间。

总结一些应用程序：

高维特征向量的降维
- 电子邮件分类任务中的文字，对垃圾邮件进行协作过滤
稀疏
即时的词汇
跨产品功能
多任务学习

参考：

原始论文：
1. 功能哈希
2. Shi，Q。，Petterson，J.，Dror，G.，Langford，J.，Smola，A.，Strehl，A。，＆amp; Vishwanathan，V。（2009）。 哈希内核
What is the hashing trick
Quora
Gionis，A.，Indyk，P。，＆amp; Motwani，R。（1999）。通过散列在高维度上搜索相似性

实施：

Langford，J.，Li，L。，＆amp; Strehl，A。（2007）。 Vow- pal wabbit在线学习项目（技术报告）。 http://hunch.net/?p=309。

什么是功能散列（散列技巧）？

3 个答案: