Question

我有元组中各种项目组合的列表

example = [(1,2), (2,1), (1,1), (1,1), (2,1), (2,3,1), (1,2,3)]

我希望按照独特的组合进行分组和计数

产生结果

result = [((1,2), 3), ((1,1), 2), ((2,3,1), 2)]

不重要的是保持订单或保留组合的排列但非常重要的是使用 lambda函数完成操作，输出格式仍然是上面的元组列表，因为我将使用spark RDD对象

我的代码目前使用

计算从数据集中获取的模式

RDD = sc.parallelize(example) result = RDD.map(lambda(y):(y, 1))\ .reduceByKey(add)\ .collect() print result

我需要另一个.map命令，它将为不同的排列添加帐户，如上所述

Answer 1

您可以使用OrderedDict根据其项目的已排序案例创建有序词典：

>>> from collections import OrderedDict
>>> d=OrderedDict()
>>> for i in example:
...   d.setdefault(tuple(sorted(i)),i)
... 
('a', 'b')
('a', 'a', 'a')
('a', 'a')
('a', 'b')
('c', 'd')
('b', 'c', 'a')
('b', 'c', 'a')
>>> d
OrderedDict([(('a', 'b'), ('a', 'b')), (('a', 'a', 'a'), ('a', 'a', 'a')), (('a', 'a'), ('a', 'a')), (('c', 'd'), ('c', 'd')), (('a', 'b', 'c'), ('b', 'c', 'a'))])
>>> d.values()
[('a', 'b'), ('a', 'a', 'a'), ('a', 'a'), ('c', 'd'), ('b', 'c', 'a')]

Answer 2

这个怎么样：维护一个包含您已经看过的每个项目的排序形式的集合。如果您尚未看到已排序的表单，则只将项目添加到结果列表中。

example = [ ('a','b'), ('a','a','a'), ('a','a'), ('b','a'), ('c', 'd'), ('b','c','a'), ('a','b','c') ]
result = []
seen = set()
for item in example:
    sorted_form = tuple(sorted(item))
    if sorted_form not in seen:
        result.append(item)
        seen.add(sorted_form)
print result

结果：

[('a', 'b'), ('a', 'a', 'a'), ('a', 'a'), ('c', 'd'), ('b', 'c', 'a')]

Answer 3

Since you are looking for a lambda function, try the following:

lambda x, y=OrderedDict(): [a for a in x if y.setdefault(tuple(sorted(a)), a) and False] or y.values()

You can use this lambda function like so:

uniquify = lambda x, y=OrderedDict(): [a for a in x if y.setdefault(tuple(sorted(a)), a) and False] or y.values()
result = uniquify(example)

Obviously, this sacrifices readability over the other answers. It is basically doing the same thing as Kasramvd's answer, in a single ugly line.

Answer 4

This is similar as the sorted dict.

from itertools import groupby
ex = [(1,2,3), (3,2,1), (1,1), (2,1), (1,2), (3,2), (2,3,1)]
f = lambda x: tuple(sorted(x)) as key
[tuple(k) for k, _ in groupby(sorted(ex, key=f), key=f)]

The nice thing is that you can get which are tuples are of the same combination:

In [16]: example = [ ('a','b'), ('a','a','a'), ('a','a'), ('a', 'a', 'a', 'a'), ('b','a'), ('c', 'd'), ('b','c','a'), ('a','b','c') ]
In [17]: for k, grpr in groupby(sorted(example, key=lambda x: tuple(sorted(x))), key=lambda x: tuple(sorted(x))):
    print k, list(grpr)
   ....:     
('a', 'a') [('a', 'a')]
('a', 'a', 'a') [('a', 'a', 'a')]
('a', 'a', 'a', 'a') [('a', 'a', 'a', 'a')]
('a', 'b') [('a', 'b'), ('b', 'a')]
('a', 'b', 'c') [('b', 'c', 'a'), ('a', 'b', 'c')]
('c', 'd') [('c', 'd')]

Answer 5

根据评论，您实际上需要的是map-reduce。我没有安装Spark，但根据文档（参见transformations），这必须是这样的：

data.map(lambda i: (frozenset(i), i)).reduceByKey(lambda _, i : i)

如果您的数据集按此顺序排列(b, a)，则会返回(a, b), (b, a)。

Answer 6

我解决了自己的问题，但很难理解我用的是什么

example = [(1,2), (1,1,1), (1,1), (1,1), (2,1), (3,4), (2,3,1), (1,2,3)]
RDD = sc.parallelize(example)
result = RDD.map(lambda x: list(set(x)))\
            .filter(lambda x: len(x)>1)\
            .map(lambda(x):(tuple(x), 1))\
            .reduceByKey(add)\
            .collect()
print result

也消除了简单的重复值，如（1,1）和（1,1,1），这对我有好处

Python中k，v元组列表中的唯一组合

6 个答案: