根据公共ID对元组列表中的项目进行分组

时间:2018-01-25 15:40:28

标签: python

我有一个大型同义词数据集(10000+)作为元组列表,如下所示:

data = [
    (435347,'cat'),
    (435347,'feline'),
    (435347,'lion'),
    (6765756,'dog'),
    (6765756,'hound'),
    (6765756,'puppy'),
    (435347,'kitten'),
    (987977,'frog')
]

其中每个同义词由任意共享ID标识,在本例中为4353476765756987977

我想编写一个使数据看起来像这样的函数:

processed_data = [
    (435347,'cat','feline','lion','kitten'),
    (6765756,'dog','hound','puppy'),
    (987977,'frog')
]

非常感谢任何建议!

7 个答案:

答案 0 :(得分:2)

试试这个:

groups = {}

for x, y in data:
    group = groups.get(x, [])
    group.append(y)
    groups[x] = group

print(groups)

输出:

{987977: ['frog'], 435347: ['cat', 'feline', 'lion', 'kitten'], 6765756: ['dog', 'hound', 'puppy']}

答案 1 :(得分:1)

dictionary = {}
for val in data:
    id_, name = val
    if id_ in dictionary:
        dictionary[id_].append(name)
    else:
        dictionary[id_] = [id_, name]
print(list(dictionary.values()))
>>> [[435347, 'cat', 'feline', 'lion', 'kitten'], [6765756, 'dog', 'hound', 'puppy'], [987977, 'frog']]

答案 2 :(得分:1)

你可以尝试这个:

data = [(435347,'cat'),(435347,'feline'),(435347,'lion'),(6765756,'dog'),(6765756,'hound'),(6765756,'puppy'),(435347,'kitten'),(987977,'frog')]

dataset = set(i[0] for i in data)
processed_data = sorted([(tuple([i]) + tuple(j[1] for j in data if j[0]==i)) for i in dataset])
print(processed_data)

输出:

[(435347, 'cat', 'feline', 'lion', 'kitten'), (987977, 'frog'), (6765756, 'dog', 'hound', 'puppy')]

答案 3 :(得分:0)

这是另一种方法,它是my answer对另一个问题的修改。您可以使用reducemap

来实现此目的
def reducer(x, y):
    if isinstance(x, dict):
        ykey, yval = y
        if ykey not in x:
            x[ykey] = [yval]
        else:
            x[ykey] += [yval]
        return x
    else:
        xkey, xval = x
        ykey, yval = y
        a = {xkey: [xval]}
        if ykey in a:
            a[ykey] += [yval]
        else:
            a[ykey] = [yval]
        return a

processed_data = map(lambda x: (x[0],) + tuple(x[1]), reduce(reducer, data).items())

输出:

>>> print processed_data
[(987977, 'frog'),
 (435347, 'cat', 'feline', 'lion', 'kitten'),
 (6765756, 'dog', 'hound', 'puppy')]

<强>解释

一步一步地打破它:

函数reducer()按键将项目分组到字典中。字典的值是一个列表,它附加了同义词值。

>>> print(reduce(reducer, data))
{435347: ['cat', 'feline', 'lion', 'kitten'],
 987977: ['frog'],
 6765756: ['dog', 'hound', 'puppy']}

我们在.items()函数的输出上调用reduce(),将其作为tuples列表:

>>> print(reduce(reducer, data).items())
[(987977, ['frog']),
 (435347, ['cat', 'feline', 'lion', 'kitten']),
 (6765756, ['dog', 'hound', 'puppy'])]

最后,我们调用map()将此输出转换为您想要的格式。

答案 4 :(得分:0)

字典可能是更适合您的问题的解决方案:

data = [(435347,'cat'),(435347,'feline'),(435347,'lion'),(6765756,'dog'),(6765756,'hound'),(6765756,'puppy'),(435347,'kitten'),(987977,'frog')]
results = {}
for key, item in data:
    results.setdefault(key,[]).append(item)

<强>输出:

{435347: ['cat', 'feline', 'lion', 'kitten'],
 987977: ['frog'],
 6765756: ['dog', 'hound', 'puppy']}

setdefault是您案件的理想候选人。如果密钥不存在,它基本上创建一个字典条目,如果密钥存在,则附加到条目。

答案 5 :(得分:0)

有很多方法,其中一些是:

数据是:

data = [
    (435347,'cat'),
    (435347,'feline'),
    (435347,'lion'),
    (6765756,'dog'),
    (6765756,'hound'),
    (6765756,'puppy'),
    (435347,'kitten'),
    (987977,'frog')
]

Itertools groupby:

from itertools import groupby

print([tuple(i) for j,i in groupby(sorted(data),key=lambda x:x[0])])

收集默认字典:

from collections import defaultdict

d=defaultdict(list)
for i in data:
    d[i[0]].append(i)

print(d)

没有任何模块:

without_module={}
for i in data:
    if i[0] not in without_module:
        without_module[i[0]]=[i]
    else:
        without_module[i[0]].append(i)
print(without_module)

答案 6 :(得分:-1)

好吧这是一个建议,所以如果错了就不要生气 -

因此,尝试创建一个输入并创建一个for语句,并使其从.txt文件或您喜欢的内容中读取数据。并在for下创建一个if语句。

代码:

animal=input("Animal: ")
f=open("animal.txt")
for line in f:
    if genre in line.strip():
        print(line)

会亲自建议并将数据全部放入数组并执行\ n