Question

我有250,000个列表，每个列表平均包含100个字符串，存储在10个字典中。我需要计算所有列表的成对相似性（这里的相似性度量不相关;但是，简而言之，它涉及获取两个列表的交集并将结果标准化为某个常量）。

我为成对比较提出的代码非常简单。我只是使用itertools.product将每个列表与其他列表进行比较。问题是以节省时间的方式在250,000个列表上执行这些计算。对于处理类似问题的任何人：根据以下标准，哪种常用选项（scipy，PyTables）最适合：

支持python数据类型
巧妙地存储一个非常稀疏的矩阵（大约80％的值将为0）
高效（可在10小时内完成计算）

Answer 1

您是否只想要最有效的方法来确定数据中任意两点之间的距离？

或者你真的需要这个 m x m 距离矩阵来存储数据中所有行的所有成对相似度值吗？

通常，将数据保存在某个指标空间中会更有效率，使用为快速检索而优化的数据结构预先计算成对的相似度值，然后查看它们。毋庸置疑，距离矩阵选项可怕地扩展 - n个数据点需要n×n距离矩阵来成对存储相似度得分。

kd-tree 是小维数据的首选技术（这里的“小”意味着小于20的特征数量）; Voronoi tesselation 通常是高维数据的首选。

最近，球树已被用作卓越的选择两者 - 它具有 kd-tree 的性能但没有降级在高维度。

scikit-learn 有一个很好的实现，其中包括单元测试。它已有详细记录，目前正在积极开发中。

scikit-learn 构建于 NumPy 和 SciPy 之上，因此两者都是依赖关系。本网站提供了各种installation options scikit-learn 。

Ball Trees最常见的用例是 k-Nearest Neighbors ;但它会例如，在OP中描述的情况下，它本身就能很好地工作。

您可以使用 scikit-learn Ball Tree 实现，如下所示：

>>> # create some fake data--a 2D NumPy array having 10,000 rows and 10 columns
>>> D = NP.random.randn(10000 * 10).reshape(10000, 10)

>>> # import the BallTree class (here bound to a local variable of same name)
>>> from sklearn.neighbors import BallTree as BallTree

>>> # call the constructor, passing in the data array and a 'leaf size'
>>> # the ball tree is instantiated and populated in the single step below:

>>> BT = BallTree(D, leaf_size=5, p=2)

>>> # 'leaf size' specifies the data (number of points) at which 
>>> # point brute force search is triggered
>>> # 'p' specifies the distance metric, p=2 (the default) for Euclidean;
>>> # setting p equal to 1, sets Manhattan (aka 'taxi cab' or 'checkerboard' dist)

>>> type(BT)
    <type 'sklearn.neighbors.ball_tree.BallTree'>

实例化＆amp;填充球树 非常快 （使用Corey Goldberg的timer class定时）：

>>> with Timer() as t:
        BT = BallTree(D, leaf_size=5)

>>> "ball tree instantiated & populated in {0:2f} milliseconds".format(t.elapsed)
        'ball tree instantiated & populated in 13.90 milliseconds'

查询球树也快：

示例查询：提供最接近数据点行索引500的三个数据点; 并为每个人返回他们的索引以及他们与D [500，：]

的参考点之间的距离

>>> # ball tree has an instance method, 'query' which returns pair-wise distance
>>> # and an index; one distance and index is returned per 'pair' of data points

>>> dx, idx = BT.query(D[500,:], k=3)

>>> dx    # distance
    array([[ 0.   ,  1.206,  1.58 ]])

>>> idx    # index
    array([[500, 556, 373]], dtype=int32)

>>> with Timer() as t:
    dx, idx = BT.query(D[500,:], k=3)


>>> "query results returned in {0:2f} milliseconds".format(t.elapsed)
        'query results returned in 15.85 milliseconds'

scikit-learn Ball Tree实现中的默认距离度量是 Minkowski ，这只是欧几里得和曼哈顿的推广（即，在Minkowski表达式中，那里是一个参数p，当设置为2时折叠为欧几里德和曼哈顿，p = 1。

Answer 2

如果您定义了适当的距离（相似度）功能，那么scipy.spatial.distance中的某些功能可能会有所帮助

计算250k列表的成对相似性的最有效方法

2 个答案: