sklearn FeatureHasher并行化

时间:2016-10-09 20:18:57

标签: python machine-learning scikit-learn feature-extraction

由于使用了散列技巧,sklearn的Featurehasher特征提取器与其DictVectorizer特征提取器相比具有多个优势。

一个似乎更难挖掘的优点是它能够并行运行。

我的问题是,如何轻松地FeatureHasher并行运行?

1 个答案:

答案 0 :(得分:2)

您可以使用FeatureHasher.transform实现joblib的并行版本(scikit-learn支持并行处理的库):

from sklearn.externals.joblib import Parallel, delayed
import numpy as np
import scipy.sparse as sp

def transform_parallel(self, X, n_jobs):
    transform_splits = Parallel(n_jobs=n_jobs, backend="threading")(
        delayed(self.transform)(X_split)
        for X_split in np.array_split(X, n_jobs))

    return sp.vstack(transform_splits)

FeatureHasher.transform_parallel = transform_parallel
f = FeatureHasher()
f.transform_parallel(np.array([{'a':3,'b':2}]*10), n_jobs=5)

<10x1048576 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>