我试图通过引入计算邻居之间距离的替代方法来扩展scikit-learn类KNeighborsClassifier
(如果感兴趣,请参阅here。)
并行化方案如下: 假设我们想要计算集合A和集合B的所有元素之间的距离,对于A中的每个元素(依次顺序),计算B 中所有元素的并行距离。 耗时的操作是计算任何两个元素之间的距离,因此每个过程都应该执行这个基本操作。
问题在于并行执行比串行执行(使用Python的multiprocessing
模块)慢得多,无论是使用同步调用还是异步调用,无论机器和使用的内核数量如何。
我怀疑这与使用共享变量有关,共享变量是在后台传递的。问题是,正在传达哪些变量以及如何避免这种变量?
代码:
class WordMoversKNN(KNeighborsClassifier):
"""K nearest neighbors classifier using the Word Mover's Distance.
Parameters
----------
W_embed : array, shape: (vocab_size, embed_size)
Precomputed word embeddings between vocabulary items.
Row indices should correspond to the columns in the bag-of-words input.
n_neighbors : int
Number of neighbors to use by default for :meth:`k_neighbors` queries.
n_jobs : int
The number of parallel jobs to run for Word Mover's Distance computation.
If ``-1``, then the number of jobs is set to the number of CPU cores.
verbose : int, optional
Controls the verbosity; the higher, the more messages. Defaults to 0.
"""
def __init__(self, W_embed, n_neighbors=1, n_jobs=1, verbose=5):
self.W_embed = W_embed
self.verbose = verbose
if n_jobs == -1:
n_jobs = mp.cpu_count()
super(WordMoversKNN, self).__init__(n_neighbors=n_neighbors, n_jobs=n_jobs, metric='precomputed', algorithm='brute')
def _wmd(self, i, row, X_train):
"""Compute the WMD between training sample i and given test row.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
"""
union_idx = np.union1d(X_train[i].indices, row.indices)
W_minimal = self.W_embed[union_idx]
W_dist = euclidean_distances(W_minimal)
bow_i = X_train[i, union_idx].A.ravel()
bow_j = row[:, union_idx].A.ravel()
return emd(bow_i, bow_j, W_dist)
def _wmd_row(self, row, X_train):
"""Wrapper to compute the WMD of a row with all training samples.
Assumes that `row` and train samples are sparse BOW vectors summing to 1.
Useful for parallelization.
"""
n_samples_train = X_train.shape[0]
return [self._wmd(i, row, X_train) for i in range(n_samples_train)]
def _pairwise_wmd(self, X_test, X_train=None, ordered=True):
"""Computes the word mover's distance between all train and test points.
Parallelized over rows of X_test.
Assumes that train and test samples are sparse BOW vectors summing to 1.
Parameters
----------
X_test: scipy.sparse matrix, shape: (n_test_samples, vocab_size)
Test samples.
X_train: scipy.sparse matrix, shape: (n_train_samples, vocab_size)
Training samples. If `None`, uses the samples the estimator was fit with.
ordered: returns result keeping the order of the rows in dist (following X_test).
Otherwise, the rows of dist follow a potentially random order which does not follow the order
of indices in X_test. However, computation is faster in this case (asynchronous parallel execution)
Returns
-------
dist : array, shape: (n_test_samples, n_train_samples)
Distances between all test samples and all train samples.
"""
n_samples_test = X_test.shape[0]
if X_train is None: X_train = self._fit_X
if (self.n_jobs == 1) or (n_samples_test < 2*self.n_jobs): # to avoid parallelism overhead for small test samples
dist = [ self._wmd_row( test_sample , X_train ) for test_sample in X_test ]
else:
if self.verbose:
print("WordMoversKNN set to use {} parallel processes".format(self.n_jobs))
if ordered:
dist = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)( delayed(self._wmd_row) (test_sample, X_train) for test_sample in X_test)
else: # Asynchronous call is faster but returns results in random order
pool = mp.Pool(processes=self.n_jobs)
results = [pool.apply_async(self._wmd_row, args=(test_sample, X_train)) for test_sample in X_test]
dist = [p.get() for p in results]
return np.array(dist)
def calculate(self, X):
"""Predict the class labels for the provided data
Parameters
----------
X : scipy.sparse matrix, shape (n_test_samples, vocab_size)
Test samples.
Returns
-------
y : array of shape [n_samples]
Class labels for each data sample.
"""
X = check_array(X, accept_sparse='csr', copy=True)
X = normalize(X, norm='l1', copy=False)
dist = self._pairwise_wmd(X)
# A matrix of distances given to predict in combination with metric = 'precomputed'
# means that no more distance calculations take place. Neighbors are found simply by sorting
return super(WordMoversKNN, self).predict(dist)
答案 0 :(得分:0)
主要问题是矩阵X_test
的每一行的产生了一个新进程,每次都需要传递完整的X_train
以及其他变量(例如{ {1}})每个过程。由于它们的大小,酸洗和分派这些变量非常耗时。
当我将self.X_embed
矩阵X_test
分成大小为n_jobs
的{{1}}块时,我获得了极大的加速,整体只生成X_test.shape[0]//n_jobs
个进程并且必须传递变量{{1而不是n_jobs
次。
但是,由于必须传递的变量的大小,我认为对于这种类型的问题,数据并行性比任务并行性更合适,因此我打算使用n_jobs
,以便每个流程分别创建自己的X_test.shape[0]
,mpi4py
和self.W_embed
矩阵,仅传达计算结果。