我已经执行了以下脚本:
from itertools import groupby
from pprint import pprint as prnt
dt = [('23271800', 0.00066790780636275307),
('23271812', 0.0010018617095441298),
('26112103', 0.00066790780636275307),
('27111616', 0.0056772163540834012),
# ... many lines deleted ...
('40161500', 0.00040074468381765189)
]
agg = groupby(dt, lambda x: x[0])
lst = list(agg)
lst1 = map(lambda x: (x[0], list(x[1])), lst)
prnt(lst1)
对于项'23271800'
,它应该将[('23271800', 0.00066790780636275307)]
报告为其对应的groupby项。但是,我得到的输出不正确。
[('23271800', []),
('23271812', []),
('26112103', []),
('27111616', []),
# ... many lines deleted ...
('40161500', [('40161500', 0.00040074468381765189)])]
需要帮助理解我在这里做错了什么。
PS:代码粘贴:http://codepad.org/cCd8DfoT
答案 0 :(得分:7)
itertools.groupby()
返回的迭代器使用起来有点棘手。 As it says in the documentation:
返回的组本身就是一个迭代器,它与
groupby()
共享底层的iterable。由于源是共享的,因此当groupby()
对象前进时,前一个组将不再可见。
这意味着您必须处理由groupby()
对象生成的每个组。如果您稍后查看该组,您会发现它的内容已被跳过。例如:
>>> from itertools import groupby
>>> groups = list(groupby('AAABBBCCC'))
>>> groups
[('A', <itertools._grouper object at 0x107155490>),
('B', <itertools._grouper object at 0x1071553d0>),
('C', <itertools._grouper object at 0x107155d50>)]
>>> list(groups[0][1])
[]
文档说:
因此,如果稍后需要该数据,则应将其存储为列表。
例如:
>>> groups = [(key, list(group)) for key, group in groupby('AAABBBCCC')]
>>> groups[0][1]
['A', 'A', 'A']
但通常最好尝试重新组织代码,以便您可以依次处理每个组,而无需将其存储在列表中。例如,像这样:
for key, group in groupby('AAABBBCCC'):
for item in group:
# do something with item
答案 1 :(得分:1)
d = [(key, list(group)) for key, group in groupby(dt, lambda x: x[0])]
prnt(d)
groupby
将为找到的每个组返回组和组的生成器。
<强>输出强>
[('23271800', [('23271800', 0.0006679078063627531)]),
('23271812', [('23271812', 0.0010018617095441298)]),
('26112103', [('26112103', 0.0006679078063627531)]),
('27111616', [('27111616', 0.005677216354083401)]),
('30101600',
[('30101600', 1.3909064158636346e-05), ('30101600', 0.002002905238843634)]),
('30102200', [('30102200', 0.00013358156127255062)]),
('31100000', [('31100000', 2.1849453575689805e-05)]),
('31161500', [('31161500', 0.0005180729752775727)]),
('31161501', [('31161501', 0.00012902764441098641)]),
('31161505', [('31161505', 0.013866049271881438)]),
('31161513', [('31161513', 0.021559049445886335)]),
('31161518', [('31161518', 0.0011596016382808651)]),
('31161520', [('31161520', 0.022263593545425106)]),
('31161600', [('31161600', 0.003930380552826971)]),
('31161618', [('31161618', 0.0016029787352706075)]),
('31161620', [('31161620', 0.0008462931211056002)]),
('31161700', [('31161700', 0.0008833842874611101)]),
('31161716', [('31161716', 7.067074299688881e-05)]),
('31161717', [('31161717', 0.0014193040885208503)]),
('31161727', [('31161727', 0.01364664212812536)]),
('31161801', [('31161801', 0.000179280516444739)]),
('31161900',
[('31161900', 1.6624352427769844e-05), ('31161900', 0.0001496191718499286)]),
('31161904', [('31161904', 6.666007460763289e-05)]),
('31162409', [('31162409', 0.007129527514430318)]),
('31162800',
[('31162800', 0.0002625302360269781),
('31162800', 0.359403893120933),
('31162800', 0.2207879284986886),
('31162800', 0.0002625302360269781)]),
('31163200',
[('31163200', 0.00037295581888139136),
('31163200', 4.1439535431265705e-05)]),
('31163201', [('31163201', 0.011292216638533014)]),
('31163202',
[('31163202', 4.5417730832667214e-05),
('31163202', 4.5417730832667214e-05)]),
('31163203', [('31163203', 0.003471418917146539)]),
('31163204', [('31163204', 0.0002962025923869601)]),
('31163214', [('31163214', 0.0014119501813264418)]),
('31163215', [('31163215', 0.017772155543217604)]),
('31171504', [('31171504', 0.05423235622453355)]),
('31181600', [('31181600', 5.262772981769086e-05)]),
('31181602', [('31181602', 0.00019920057382748777)]),
('31191518', [('31191518', 0.0014972878296483697)]),
('39121719', [('39121719', 0.0022708865416333607)]),
('40141600', [('40141600', 5.0614113112184855e-05)]),
('40141607', [('40141607', 0.0958030259751574)]),
('40141616',
[('40141616', 0.005499007768977646), ('40141616', 0.00015275021580493458)]),
('40141636', [('40141636', 0.0007247510239255406)]),
('40141680', [('40141680', 0.12267031972561518)]),
('40142000', [('40142000', 0.0002962025923869601)]),
('40142100', [('40142100', 8.188292818389522e-05)]),
('40142315', [('40142315', 0.00034758467473980007)]),
('40142323', [('40142323', 0.0006308018171203779)]),
('40161500', [('40161500', 0.0004007446838176519)])]
答案 2 :(得分:1)
听起来你想要一个非流式的groupby操作。 itertools
中的实现可能对您的应用程序而言过度。您可以尝试implementation in toolz
。