Question

之前已经发布了比较列表列表，但我正在使用的python环境无法完全集成numpy中的所有方法和类。我也不能导入大熊猫。

我正在尝试比较大列表中的列表，并提出大约8-10个列表，这些列表与大列表中的所有其他列表近似。

如果我在大列表中有＆lt; 50列表，我的方法可以正常工作。但是，我试图比较至少20k列表，理想情况下是1百万+。我目前正在研究itertools。在不使用numpy或pandas的情况下，对于大型数据集来说，最快，最有效的方法是什么？

我可以在numpy中使用一些方法和类，但不是全部。例如，numpy.allclose和numpy.all无法正常工作，这是因为我正在处理的环境。

    global rel_tol, avg_lists
    rel_tol=.1 
    avg_lists=[]
    #compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
    for j in range(len(big_list)):

        for k in range(len(big_list)):

            array1=np.array(big_list[j])
            array2=np.array(big_list[k])
            if j!=k:
            #if j is not k:  

                diff=np.subtract(array1, array2)
                abs_diff=np.absolute(diff)

                #cannot use numpy.allclose
                #if the deviation for the largest value in the array is < 10%
                if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:

                    cntr+=1
                    avg_lists.append(big_list[k])

Answer 1

从根本上说，看起来你的目标是聚类操作（即通过K＆lt; N聚类中心表示一组N个点）。我会建议使用K-Means clustering方法，在此方法中增加K，直到群集的大小低于所需的阈值。

我不确定你的意思是“无法完全整合numpy中的所有方法和类”，但如果scikit-learn可用，你可以使用它的K-means estimator。如果那是不可能的，那么K-means算法的简单版本是relatively easy to code from scratch，您可以使用它。

这是使用scikit-learn的k-means方法：

# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]

# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape)  # (8, 10)

# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_

从这里你可以例如找到每个群集中与其中心最近的点，并将其视为平均值。如果没有您要解决的问题的更多细节，很难肯定地说。但是像这样的聚类方法将是解决问题的最有效方法，就像你在问题中所说的那样。

快速比较大量列表

1 个答案: