在列中查找重复项,返回唯一项并从python中的另一列列出其对应的值

时间:2015-01-23 22:49:43

标签: python no-duplicates

我想从第1列中删除重复项,并在colum 2中使用python返回与每个唯一项关联的相关值列表。

输入

1 2
Jack London 'Son of the Wolf'
Jack London 'Chris Farrington'
Jack London 'The God of His Fathers'
Jack London 'Children of the Frost'
William Shakespeare  'Venus and Adonis' 
William Shakespeare 'The Rape of Lucrece'
Oscar Wilde 'Ravenna'
Oscar Wilde 'Poems'

而输出应为

1 2
Jack London 'Son of the Wolf, Chris Farrington, Able Seaman, The God of His Fathers,Children of the Frost'
William Shakespeare 'The Rape of Lucrece,Venus and Adonis' 
Oscar Wilde 'Ravenna,Poems'

其中第二列包含与每个项目关联的值的总和。 我在字典

上尝试了set()函数
dic={'Jack London': 'Son of the Wolf', 'Jack London': 'Chris Farrington', 'Jack London': 'The God of His Fathers'}
set(dic)

但它只返回字典的第一个键

set(['Jack London'])

2 个答案:

答案 0 :(得分:2)

在Python中,字典每个键只能包含一个值。但该值可以是项目的集合:

>>> d = {'Jack London': ['Son of the Wolf', 'Chris Farrington']}
>>> d['Jack London']
['Son of the Wolf', 'Chris Farrington']

要从一系列键值对构造这样的字典,您可以执行以下操作:

dct = {}
for author, title in items:
    if author not in dct:
        # Create a new entry for the author
        dct[author] = [title]
    else:
        # Add another item to the existing entry
        dct[author].append(title)

循环体可以更加简洁:

dct = {}
for author, title in items:
    dct.setdefault(author, []).append(title)

答案 1 :(得分:2)

您应该使用itertools.groupby,因为您的列表已排序。

rows = [('1', '2'),
        ('Jack London', 'Son of the Wolf'),
        ('Jack London', 'Chris Farrington'),
        ('Jack London', 'The God of His Fathers'),
        ('Jack London', 'Children of the Frost'),
        ('William Shakespeare', 'Venus and Adonis'),
        ('William Shakespeare', 'The Rape of Lucrece'),
        ('Oscar Wilde', 'Ravenna'),
        ('Oscar Wilde', 'Poems')]
# I'm not sure how you get here, but that's where you get

from itertools import groupby
from operator import itemgetter

grouped = groupby(rows, itemgetter(0))
result = {group:', '.join([value[1] for value in values]) for group, values in grouped}

这会给你一个结果:

In [1]: pprint(result)
{'1': '2',
 'Jack London': 'Son of the Wolf, Chris Farrington, The God of His Fathers, '
                'Children of the Frost',
 'Oscar Wilde': 'Ravenna, Poems',
 'William Shakespeare': 'Venus and Adonis, The Rape of Lucrece'}