Question

我有以下元组列表。

[('0', 'Hadoop'), ('0', 'Big Data'), ('0', 'HBas'), ('0', 'Java'), ('0', 'Spark'), ('0', 'Storm'), ('0', 'Cassandra'), ('1', 'NoSQL'), ('1', 'MongoDB'), ('1', 'Cassandra'), ('1', 'HBase'), ('1', 'Postgres'), ('2', 'Python'), ('2', 'skikit-learn'), ('2', 'scipy'), ('2', 'numpy'), ('2', 'statsmodels'), ('2', 'pandas'), ('3', 'R'), ('3', 'Python'), ('3', 'statistics'), ('3', 'regression'), ('3', 'probability'), ('4', 'machine learning'), ('4', 'regression'), ('4', 'decision trees'), ('4', 'libsvm'), ('5', 'Python'), ('5', 'R'), ('5', 'Java'), ('5', 'C++'), ('5', 'Haskell'), ('5', 'programming languages'), ('6', 'statistics'), ('6', 'probability'), ('6', 'mathematics'), ('6', 'theory'), ('7', 'machine learning'), ('7', 'scikit-learn'), ('7', 'Mahout'), ('7', 'neural networks'), ('8', 'neural networks'), ('8', 'deep learning'), ('8', 'Big Data'), ('8', 'artificial intelligence'), ('9', 'Hadoop'), ('9', 'Java'), ('9', 'MapReduce'), ('9', 'Big Data')]

左侧的值为“员工ID号”，而右侧的值为“权益”。我必须用两种不同的方式将它们变成字典：我必须使员工ID成为键，而利益是值，然后使我成为密钥，而使员工ID成为值。基本上，作为一个简单的例子，我需要最终结果的元素之一如下所示：

{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'],
 '1' ... etc]}

然后下一个看起来像这样：

{'Hadoop': [0,9]...}

我尝试了默认字典，但似乎无法使其正常工作。有什么建议吗？

Answer 1

您可以使用collections.defaultdict

例如：

from collections import defaultdict

lst = [('0', 'Hadoop'),
('0', 'Big Data'),
('0', 'HBas'),
('0', 'Java'),.....]

result = defaultdict(list)
for idVal, interest in lst:
    result[idVal].append(interest)
print(result)

result = defaultdict(list)
for idVal, interest in lst:
    result[interest].append(idVal)
print(result)

输出：

defaultdict(<type 'list'>, {'1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'], '0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '3': ['R', 'Python', 'statistics', 'regression', 'probability'], '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'], '5': ['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'], '4': ['machine learning', 'regression', 'decision trees', 'libsvm'], '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks'], '6': ['statistics', 'probability', 'mathematics', 'theory'], '9': ['Hadoop', 'Java', 'MapReduce', 'Big Data'], '8': ['neural networks', 'deep learning', 'Big Data', 'artificial intelligence']})
defaultdict(<type 'list'>, {'Java': ['0', '5', '9'], 'neural networks': ['7', '8'], 'NoSQL': ['1'], 'Hadoop': ['0', '9'], 'Mahout': ['7'], 'Storm': ['0'], 'regression': ['3', '4'], 'statistics': ['3', '6'], 'probability': ['3', '6'], 'programming languages': ['5'], 'Python': ['2', '3', '5'], 'deep learning': ['8'], 'Haskell': ['5'], 'mathematics': ['6'], 'HBas': ['0'], 'numpy': ['2'], 'pandas': ['2'], 'artificial intelligence': ['8'], 'theory': ['6'], 'libsvm': ['4'], 'C++': ['5'], 'R': ['3', '5'], 'HBase': ['1'], 'Spark': ['0'], 'Postgres': ['1'], 'decision trees': ['4'], 'Big Data': ['0', '8', '9'], 'MongoDB': ['1'], 'scikit-learn': ['7'], 'MapReduce': ['9'], 'machine learning': ['4', '7'], 'scipy': ['2'], 'skikit-learn': ['2'], 'statsmodels': ['2'], 'Cassandra': ['0', '1']})

Answer 2

collections.defaultdict确实是解决此问题的正确方法。为所需的每本字典创建一个，然后遍历列表，并将每对都添加到两个字典中。

import collections

ids = collections.defaultdict(list)
interests = collections.defaultdict(list)

for ident,interest in data:
    ids[ident].append(interest)
    interests[interest].append(ident)

Answer 3

pandas怎么样？

data = [('0', 'Hadoop'),
('0', 'Big Data'),
('0', 'HBas'),...]

import pandas as pd
df = pd.DataFrame(data)
df_1 = df.groupby(0)[1].apply(list)
df_2 = df.groupby(1)[0].apply(list)

print( df_1.to_dict() )
print( df_2.to_dict() )

结果：

{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', '...
{'Big Data': ['0', '8', '9'], 'C++' ...

Answer 4

我想到的

大多数pythonic和最短的代码，不使用导入：

alist = [('0', 'Hadoop'),
('0', 'Big Data'),
('0', 'HBas'),
('0', 'Java'),
('0', 'Spark'),
('0', 'Storm'),...]

adict = {}
bdict = {}
for key, value in alist:
    adict[key] = adict.get(key, []) + [value]
    bdict[value] = bdict.get(value, []) + [key]

输出：

print(adict)
#{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'],...}

print(bdict)
#{'Hadoop': ['0', '9'], 'Big Data': ['0', '8', '9'], 'HBas': ['0'], 'Java': ['0', '5', '9'], 'Spark': ['0'], 'Storm': ['0'],...}

Answer 5

defaultdict是更快的选项，但您也可以通过一次遍历与setdefault()分组：

d1 = {}
d2 = {}
for fst, snd in l:
    d1.setdefault(fst, []).append(snd)
    d2.setdefault(snd, []).append(fst)

print(d1)
print(d2)

哪些输出：

{'0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'],
 '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'],
 '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'],
 '3': ['R', 'Python', 'statistics', 'regression', 'probability'],
 '4': ['machine learning', 'regression', 'decision trees', 'libsvm'],
 '5': ['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'],
 '6': ['statistics', 'probability', 'mathematics', 'theory'],
 '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks'],
 '8': ['neural networks',
       'deep learning',
       'Big Data',
       'artificial intelligence'],
 '9': ['Hadoop', 'Java', 'MapReduce', 'Big Data']}
{'Big Data': ['0', '8', '9'],
 'C++': ['5'],
 'Cassandra': ['0', '1'],
 'HBas': ['0'],
 'HBase': ['1'],
 'Hadoop': ['0', '9'],
 'Haskell': ['5'],
 'Java': ['0', '5', '9'],
 'Mahout': ['7'],
 'MapReduce': ['9'],
 'MongoDB': ['1'],
 'NoSQL': ['1'],
 'Postgres': ['1'],
 'Python': ['2', '3', '5'],
 'R': ['3', '5'],
 'Spark': ['0'],
 'Storm': ['0'],
 'artificial intelligence': ['8'],
 'decision trees': ['4'],
 'deep learning': ['8'],
 'libsvm': ['4'],
 'machine learning': ['4', '7'],
 'mathematics': ['6'],
 'neural networks': ['7', '8'],
 'numpy': ['2'],
 'pandas': ['2'],
 'probability': ['3', '6'],
 'programming languages': ['5'],
 'regression': ['3', '4'],
 'scikit-learn': ['7'],
 'scipy': ['2'],
 'skikit-learn': ['2'],
 'statistics': ['3', '6'],
 'statsmodels': ['2'],
 'theory': ['6']}

Answer 6

您也可以使用set和dict理解来做到这一点。

data = [('0', 'Hadoop'),
('0', 'Big Data'),
('0', 'HBas'),
('0', 'Java'),
...]

ids = {id_[0] for id_ in data}
d = {id_: [intrest[1] for intrest in data if intrest[0] == id_] for id_ in ids}

结果是：

{'9': ['Hadoop', 'Java', 'MapReduce', 'Big Data'], '8': ['neural networks', 'deep learning', 'Big Data', 'artificial intelligence'], '6': ['statistics', 'probability', 'mathematics', 'theory'], '3': ['R', 'Python', 'statistics', 'regression', 'probability'], '2': ['Python', 'skikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas'], '5':['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages'],'4': ['machine learning', 'regression', 'decision trees', 'libsvm'], '0': ['Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'], '1': ['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres'], '7': ['machine learning', 'scikit-learn', 'Mahout', 'neural networks']}

修改

如果使用itertools groupby，效率会更高。

from itertools import groupby
from operator import itemgetter

id_intrests = groupby(data, key=itemgetter(0))
d = {id_: [_[1] for _ in intrests] for id_, intrests in id_intrests}

Answer 7

另一种方法是使用itertools.groupby：

import itertools

tups = [('0', 'Hadoop'),
('0', 'Big Data'),
('0', 'HBas'),
...]

{k:list(zip(*v))[1] for k, v in itertools.groupby(tups, key=lambda x:x[0])}

{'0': ('Hadoop', 'Big Data', 'HBas', 'Java', 'Spark', 'Storm', 'Cassandra'),
...
 '9': ('Hadoop', 'Java', 'MapReduce', 'Big Data')}

{k:list(zip(*v))[0] for k, v in itertools.groupby(sorted(tups, key=lambda x:x[1]), key=lambda x:x[1])}

{'Big Data': ('0', '8', '9'),
 ...
 'theory': ('6',)}

将元组列表转换为字典2种不同的方式

7 个答案: