scikit-learn:使用NN的自定义距离函数进行有问题的初始计算

时间:2015-05-28 12:33:38

标签: python scikit-learn nearest-neighbor

我为NN搜索定义了自定义距离函数(仍然是度量标准)。在返回距离之前,它将逐个处理功能。下面的脚本给出了我想要做的事情的想法。

import numpy as np
from sklearn.neighbors import NearestNeighbors # ver 0.16-git

def custom_dist_func(x,y):
    print x,y
    # a custom function will be here handling mixed features (real, nominal etc.)
    return np.sqrt(sum((x-y)**2)) # use just this for now

data = np.array([ [1,2,3,1], [4,5,6,2], [7,8,9,3], [1,3,3,2], [5,5,6,3], [9,8,9,1] ])
neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree', metric='pyfunc', func=custom_dist_func)
neigh.fit(data)

以下是运行此脚本时返回的内容。

[ 0.60337662  0.07253084  0.27630738  0.90360858  0.50337067  0.31940312
  0.42077267  0.70218361  0.15748644  0.20227022] [ 0.60337662  0.07253084  0.27630738  0.90360858  0.50337067  0.31940312
  0.42077267  0.70218361  0.15748644  0.20227022]
[ 4.5         5.16666667  6.          2.        ] [ 1.  2.  3.  1.]
[ 4.5         5.16666667  6.          2.        ] [ 4.  5.  6.  2.]
[ 4.5         5.16666667  6.          2.        ] [ 7.  8.  9.  3.]
[ 4.5         5.16666667  6.          2.        ] [ 1.  3.  3.  2.]
[ 4.5         5.16666667  6.          2.        ] [ 5.  5.  6.  3.]
[ 4.5         5.16666667  6.          2.        ] [ 9.  8.  9.  1.]

虽然其余的计算是长度为len_features = 4的向量之间,但是长度为10的向量之间存在初始计算。

我无法解释这个初始计算。当我尝试使用len_features时,它仍然存在。 10,并导致程序引发索引错误,因为所需的自定义函数分别对每个可用功能起作用。

1 个答案:

答案 0 :(得分:0)

注意这不是一个完整的答案。

我在距离函数中引发了语法错误:

def custom_dist_func(x,y):
    ff
    print x,y
    # a custom function will be here handling mixed features (real, nominal etc.)
    return np.sqrt(sum((x-y)**2)) # use just this for now

并重新编写代码(您的问题,我验证过,是第一次调用)。

输出结果为:

NameError                                 Traceback (most recent call last)
<ipython-input-1-ce434c7e8153> in <module>()
     10 data = np.array([ [1,2,3,1], [4,5,6,2], [7,8,9,3], [1,3,3,2], [5,5,6,3], [9,8,9,1] ])
     11 neigh = NearestNeighbors(n_neighbors = 3, algorithm='ball_tree', metric='pyfunc', func=custom_dist_func)
---> 12 neigh.fit(data)

/home/amit/.local/lib/python2.7/site-packages/sklearn/neighbors/base.pyc in fit(self, X, y)
    779             Training data. If array or matrix, shape = [n_samples, n_features]
    780         """
--> 781         return self._fit(X)

/home/amit/.local/lib/python2.7/site-packages/sklearn/neighbors/base.pyc in _fit(self, X)
    249             self._tree = BallTree(X, self.leaf_size,
    250                                   metric=self.effective_metric_,
--> 251                                   **self.effective_metric_params_)
    252         elif self._fit_method == 'kd_tree':
    253             self._tree = KDTree(X, self.leaf_size,

/home/amit/.local/lib/python2.7/site-packages/sklearn/neighbors/ball_tree.so in sklearn.neighbors.ball_tree.BinaryTree.__init__ (sklearn/neighbors/ball_tree.c:8430)()

/home/amit/.local/lib/python2.7/site-packages/sklearn/neighbors/dist_metrics.so in sklearn.neighbors.dist_metrics.DistanceMetric.get_metric (sklearn/neighbors/dist_metrics.c:4066)()

/home/amit/.local/lib/python2.7/site-packages/sklearn/neighbors/dist_metrics.so in sklearn.neighbors.dist_metrics.PyFuncDistance.__init__ (sklearn/neighbors/dist_metrics.c:9286)()

<ipython-input-1-ce434c7e8153> in custom_dist_func(x, y)
      3 
      4 def custom_dist_func(x,y):
----> 5     ff
      6     print x,y
      7     # a custom function will be here handling mixed features (real, nominal etc.)

NameError: global name 'ff' is not defined

因此,在创建ball tree时,它显然会失败。

事实上,此时我在X上运行的pdb显示它是你原来的矩阵。问题在于,从那里通过调用dist_metrics.pyx来愚弄pdb。

所以,这并没有解决它,而是缩小了它。我建议你看看dist_metrics.pyx并进一步弄清楚。