在itertools的groupby结果上调用roundrobin

时间:2018-11-18 01:13:24

标签: python itertools

我正在寻找一种在itertools.groupby()组成的小组中使用itertools的roundrobin配方的更有效和Python方式。

具体来说,我有一个URL列表(未排序),并且希望对其重新排序,以便其结果的排序在每个唯一的netloc(主机)之间放置最大的“距离”(或可能是多样化的),由urllib.parse中的属性定义。下面是可复制的示例。

我目前正在使用itertools.groupby()及其循环食谱,但是由于groupby()的性质,

  

返回的组本身就是一个迭代器,它与groupby()共享基础可迭代项。因为源是共享的,所以当groupby()对象前进时,先前的组将不再可见。因此,如果以后需要该数据,则应将其存储为列表。

...这似乎有必要在每个组中形成一个中间列表。

样本数据:

import itertools as it
import urllib.parse

bases = ('https://www.google.com', 'https://www.youtube.com',
         'https://docs.scipy.org', 'https://www.group.me')
urls = []
counts = (1, 5, 10, 15)
for c, b in zip(counts, bases):
    for i in range(c):
        urls.append(f'{b}/{i}')

pprint(urls)
# ['https://www.google.com/0',
#  'https://www.youtube.com/0',
#  'https://www.youtube.com/1',
#  'https://www.youtube.com/2',
#  'https://www.youtube.com/3',
#  'https://www.youtube.com/4',
#  'https://docs.scipy.org/0',
#  'https://docs.scipy.org/1',
#  'https://docs.scipy.org/2',
#  'https://docs.scipy.org/3',
#  'https://docs.scipy.org/4',
#  'https://docs.scipy.org/5',
#  'https://docs.scipy.org/6',
#  'https://docs.scipy.org/7',
#  'https://docs.scipy.org/8',
#  'https://docs.scipy.org/9',
#  'https://www.group.me/0',
#  'https://www.group.me/1',
#  'https://www.group.me/2',
#  'https://www.group.me/3',
#  'https://www.group.me/4',
#  'https://www.group.me/5',
#  'https://www.group.me/6',
#  'https://www.group.me/7',
#  'https://www.group.me/8',
#  'https://www.group.me/9',
#  'https://www.group.me/10',
#  'https://www.group.me/11',
#  'https://www.group.me/12',
#  'https://www.group.me/13',
#  'https://www.group.me/14']

当前解决方案(从每个组中提取1个,如果该组为空,则跳过该组,直到所有组都提出StopIteration为止)

grp = it.groupby(sorted(urls), key=lambda u: urllib.parse.urlsplit(u).netloc)
shuffled = list(roundrobin(*(list(g) for _, g in grp)))
#                            ^^ Each group is otherwise lost because
#                               groupby() itself is an iterator

该样本的预期输出如下:

['https://docs.scipy.org/0',
 'https://www.google.com/0',
 'https://www.group.me/0',
 'https://www.youtube.com/0',
 'https://docs.scipy.org/1',
 'https://www.group.me/1',
 'https://www.youtube.com/1',
 'https://docs.scipy.org/2',
 'https://www.group.me/10',
 'https://www.youtube.com/2',
 'https://docs.scipy.org/3',
 'https://www.group.me/11',
 'https://www.youtube.com/3',
 'https://docs.scipy.org/4',
 'https://www.group.me/12',
 'https://www.youtube.com/4',
 'https://docs.scipy.org/5',
 'https://www.group.me/13',
 'https://docs.scipy.org/6',
 'https://www.group.me/14',
 'https://docs.scipy.org/7',
 'https://www.group.me/2',
 'https://docs.scipy.org/8',
 'https://www.group.me/3',
 'https://docs.scipy.org/9',
 'https://www.group.me/4',
 'https://www.group.me/5',
 'https://www.group.me/6',
 'https://www.group.me/7',
 'https://www.group.me/8',
 'https://www.group.me/9']

什么是更有效的解决方法?

1 个答案:

答案 0 :(得分:2)

没有很大的改进,但是您可以使用itertools.zip_longest稍作调整即可达到相同的效果:

shuffled = list(x for i in it.zip_longest(*(list(g) for _, g in grp)) for x in i if x)
# flattening the sublists and only returning the non-None values

好处是您不必定义roundrobin配方。但是,节省的时间可以忽略不计(定时为n=10000):

# 3.7466756048055094 # zip_longest
# 4.077965201903506  # roundrobin

我觉得还有另一种解决方案可以在collections.Counter上使用sort(key=...)或使用sorted(list),但是我还没有解决这个问题,感觉时间复杂度可能更高比您的实现更严格,因为它可能比编译模块依赖更多的python代码。不过,这是一个有趣的问题,稍后可能会再次讨论。