Question

我有大约一百万行和200列，下面给出了两个样本。

a=[0,'aaa', 'bbb', 'ccc',.........200]
b=[1,'aaa', 'ere', 'ccc',.........200]

我想找到两行的交集通常，从我读到的是，与集合的交集非常快且成本效率高。

但是当我将上面的行（列表）转换成集合时，列表中的元素会变得混乱。

例如

set(a) becomes {'aaa', 1, 'ccc', 'bbb',.........200]

类似地设置（b）变得混乱。根据我的要求，我需要找到第一个元素，即每行被比较的ID列，但由于设置混乱，我在获取行的第一个元素时面临真正的问题。

是否有任何对象在交集时表现得与集合一样好，并且还为我提供了获取第一个元素的可行性？

当a和b之间发生交叉时，我做[0]和b [0]我应该分别得到0和1。

以下是我想要实现的目标。我有很多行和列，我想使用数据集创建一个相似度矩阵。下面给出了数据集（实际上它是一个numpy数组）：

ID   AGE  Occupation Gender Product_range   Product
0   25-34   IT          M   40-50            laptop
1   18-24   Student     F   30-40            desktop
2   25-34   IT          M   40-50            laptop
3   35-44   Research    M   60-70            TV
4   35-44   Research    M   0-1              AC
5   25-34   Lawyer      F   5-6              utensils
6   45-54   Business    F   4-5              toaster

我想用它创建一个相似性矩阵（在我们的例子中是6 * 6），其中每个矩阵元素是两行之间的相似性。如果你看到第0行和第2行实际上相似，除了行号。行号参与交集，但从不归因于结果..

我为计算相似性而编写的代码如下所示

data_set = [set(row) for row in data_train]
flattened_upper_triangle_of_the_matrix = []

columns=5  # Id doesn't participate
for row1, row2 in itertools.combinations(data_set, r=2):
    ** here I want to catch the row number, because I want to dtore the rownumber of the two rows who are much similar..**
    intersection_len = row1.intersection(row2)
       flattened_upper_triangle_of_the_matrix.append((len(intersection_len)) / columns)

return flattened_upper_triangle_of_the_matrix

Answer 1

您可以尝试订购套餐：https://pypi.python.org/pypi/ordered-set

您的问题可能重复：Does Python have an ordered set?

有序的性能等同于python集

1 个答案: