Question

我有包含电子邮件地址的列表。我应该计算列表中每个电子邮件地址的计数，并在下面的列表中显示（Google Analytics格式）：

[["to99@example.com", "260"], ["to54@example.com", "4"], ["to30@example.com", "3"],
["to16@example.com", "2"], ["to77@example.com", "2"], ["to78@example.com", "2"],
["to76@example.com", "1"], ["to32@example.com", "1"], ["to24@example.com", "1"]]

（不会包含超过100条记录）

我是按照以下方式做的：

# count number of emails
addressees = {}
for i in emails:
    if i.to in addressees: addressees[i.to] += 1
    else: addressees[i.to] = 1

结果我得到如下的字典：

{u'to23@example.com': 2, u'to50@example.com': 3, u'to77@example.com': 6, 
 u'to99@example.com': 102, u'to72@example.com': 1, u'to46@example.com': 1,
 u'to33@example.com': 1, u'to78@example.com': 1, u'to56@example.com': 1,
 u'to54@example.com': 2}

然后将其转换为我需要的列表格式：

addressees_list = []
for addr in iter(addressees):
    addressees_list.append([addr, str(addressees[addr])])

看起来很糟糕。有没有办法从一开始生成列表？我还需要按计数器值对最终列表进行排序。

Answer 1

不是，除非您先对电子邮件地址进行排序;这是一个问题的O（NlogN）解决方案，使用映射为您提供O（N）方法。

有更多的pythonic方式来产生你的输出：

from collections import Counter

counts = Counter(i.to for i in emails)
addressees_list = [[addr, str(count)] for addr, count in counts.most_common()]

collections.Counter() class允许您使用一行代码收集这些计数（为每个电子邮件对象提取.to属性），Counter.most_common() method生成所需的输出 in排序顺序。

从您想要的数据集中提取的演示：

>>> # expand email counts into a sequence of emails matching those counts
...
>>> from random import shuffle
>>> dataset = [["to99@example.com", "260"], ["to54@example.com", "4"], ["to30@example.com", "3"],
... ["to16@example.com", "2"], ["to77@example.com", "2"], ["to78@example.com", "2"],
... ["to76@example.com", "1"], ["to32@example.com", "1"], ["to24@example.com", "1"]]
>>> dataset = [e for e, count in dataset for _ in range(int(count))]
>>> shuffle(dataset)
>>> # actual counting
... 
>>> from collections import Counter
>>> counts = Counter(dataset)
>>> [[addr, str(count)] for addr, count in counts.most_common()]
[['to99@example.com', '260'], ['to54@example.com', '4'], ['to30@example.com', '3'], ['to78@example.com', '2'], ['to77@example.com', '2'], ['to16@example.com', '2'], ['to24@example.com', '1'], ['to32@example.com', '1'], ['to76@example.com', '1']]

低效的方法，作为一个单行，要求您排序两次，一次将电子邮件地址分组以进行内联计数，另一种方法则对生成的计算电子邮件地址进行排序; Counter.most_common()也使用排序，但之后只排序一次。

作为一个单行，那就是：

from itertools import groupby
from operator import itemgetter

[(e, str(c))
 for e, c in sorted(([email, sum(1 for _ in group)]
                     for email, group in groupby(sorted(i.to for i in emails))),
                     key=itemgetter(1), reverse=True)]

除了方法的低效率之外，它确实看起来很糟糕。

Answer 2

也许我误解了你的问题，但你可以试试这个：

from collections import Counter
addresses = ['to99@example.com', 'to54@example.com', 'to54@example.com', ]

[(k, v) for k, v in Counter(addresses).most_common()]

输出： [('to54@example.com', 2), ('to99@example.com', 1)]

如何避免dict到列表转换？

2 个答案: