创建一个Python生成器,该生成器从两个大列表中产生整数的有序乘积

时间:2018-07-30 18:43:23

标签: python python-3.x iterator generator

因此,我有两个非常大的数字列表l1l2。我想将l1的每个元素乘以l2的每个元素明确创建新的产品列表。因此,我想要一个发电机。这部分很容易。我可以做类似的事情

for a in l1:
    for b in l2:
        yield a * b

但是,我还需要按照产品的大小订购这些产品。我想知道是否有一些巧妙的技巧来命令yield语句,以便也可以使用生成器来完成。如果可能,在Python 3中。谢谢。

3 个答案:

答案 0 :(得分:7)

我将其称为列表xsys,并假设它们已排序。正如您在评论中指出的那样,最小乘积一定是xs[0] * ys[0]-但前提是您还假设所有数字均为非负数,因此我也将假定这一点。

在第一个产品之后,它变得更加混乱-否则您已经解决了它;-)接下来要考虑的两个是xs[0] * ys[1]xs[1] * ys[0]。很容易,但是接下来要考虑的下一步取决于赢得了哪些 。如果xs[0] * ys[1]获胜,则只需要用xs[0] * ys[2]替换它,但是如果xs[1] * ys[0]获胜,则xs[1] * ys[1]xs[2] * ys[0]都可以发挥作用。依此类推。

以下内容将跟踪随着堆增加的可能性。堆最多容纳len(xs)个项目,因此代码首先安排使xs成为较短的列表:

def upprod(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 4-tuples:
    #     (product, xs index, ys index, xs[xs index])
    h = [(xs[0] * ys[0], 0, 0, xs[0])]
    while h:
        prod, xi, yi, x = heappop(h)
        yield prod
        # same x with next y
        yi += 1
        if yi < lenys:
            heappush(h, (x * ys[yi], xi, yi, x))
        # if this is the first time we used x, start
        # the next x going
        if yi == 1:
            xi += 1
            if xi < lenxs:
                x = xs[xi]
                heappush(h, (x * ys[0], xi, 0, x))

如果存在本质上更有效的解决方案,我会感到惊喜。如果有人认为自己有一个,请首先使用此随机测试仪进行尝试:

from itertools import product
from random import randrange
MAXLEN = 10
UB = 1000
ntest = 0
while True:
    ntest += 1
    lenxs = randrange(MAXLEN + 1)
    lenys = randrange(MAXLEN + 1)
    xs = sorted(randrange(UB) for i in range(lenxs))
    ys = sorted(randrange(UB) for i in range(lenys))
    brute = sorted(a*b for a, b in product(xs, ys))
    got = list(upprod(xs, ys))
    if brute != got:
        print("OUCH!")
        print(xs)
        print(ys)
        print(brute)
        print(got)
        assert False
    if ntest % 10_000 == 0:
        print(f"finished test {ntest:,}")

编辑-从某种意义上说在理论上更好;-)

以上内容并未完全利用我们只能从索引得出的偏序:if

i1 <= i2 and j1 <= j2

那我们知道

xs[i1] * ys[j1] <= xs[i2] * ys[j2]

因为排序意味着xs[i1] <= xs[i2]ys[j1] <= ys[j2]

例如,如果索引对(0, 1)(1, 0)在堆上,并且第二对获胜,则需要将(2, 0)添加到堆中,但是{{1 }}并非如此:仅从索引来看,我们知道堆中剩余的(1, 1)的乘积不大于(0, 1)的乘积。仅在还删除了(1, 1)的情况下,才需要添加(0, 1)

通常,(1, 1)格式的每一对都有一个直接的前身(i, 0)(i-1, 0)带有一个(0, j),所有其他(0, j-1)都有两个直接的前身:(i, j)(i-1, j)。在将其所有前辈都从堆中移出之前,无需在堆上放一对。

这导致了这段代码,它看起来“更优雅”,因为更加对称:

(i, j-1)

与第一个代码相比,在许多情况下,它使堆变得更小。但是堆操作需要花费的时间与堆条目的数量成对数,并且堆仍然可以增长到def upprod(xs, ys): # xs and ys must be sorted, and non-negative from heapq import heappush, heappop # make xs the shorter if len(ys) < len(xs): xs, ys = ys, xs if not xs: return lenxs = len(xs) lenys = len(ys) # the heap holds 3-tuples: # (product, xs index, ys index) h = [(xs[0] * ys[0], 0, 0)] # interior points for which only one immediate predecessor has # been processed; there's no need to put them in the heap # until their second predecessor has been processed too pending = set() def add(xi, yi): if xi < lenxs and yi < lenys: if xi and yi: # if either is 0, only one predecessor p = xi, yi if p in pending: pending.remove(p) else: pending.add(p) return heappush(h, (xs[xi] * ys[yi], xi, yi)) while h: prod, xi, yi = heappop(h) yield prod # same x with next y; and same y with next x add(xi, yi + 1) add(xi + 1, yi) assert not pending 个条目,因此这并不是什么大的获胜。它可能已经丢失了两个新函数调用的开销(尽管内联这些函数太丑陋而无法承受)。

答案 1 :(得分:4)

我的解决方案是创建一个生成器列表,在产品矩阵的每一行中创建一个生成器,然后使用heapq.merge对这些生成器的输出进行排序。在32位计算机上,每个生成器的大小为44字节,因此整个生成器列表仅消耗少量的RAM。

heapq.merge(未提供排序键功能时)通过为您传递的每个可迭代对象创建一个三元组来工作。该元组包含可迭代对象的下一个值,可迭代对象的索引号以及对可迭代对象的__next__方法的引用。它将这些元组放置在堆上,以对可迭代对象的值进行合并排序。您可以在其Python source code中查看详细信息。

因此,我的方法虽然不如蒂姆·彼得斯(Tim Peters)的解决方案节俭,但是恕我直言,它并不算简陋。 ;)

def sorted_prod_merge(xs, ys):
    ''' mergesort generators of the rows. '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    def gen(x):
        for y in ys:
            yield x * y
    yield from merge(*[gen(x) for x in xs])

下面是一些timeit代码,显示了sorted_prod_merge的运行时间,Tim Peters的解决方案以及我的其他一些版本。我使用了Tim的变量名来保持代码统一。有趣的是,蒂姆的第一个版本的速度大约是他的高级解决方案的两倍。我的sorted_prod_row运行得非常快,但这是一个可怕的RAM猪。

timeit代码使用itertools recipes中给出的技术来耗尽迭代器:我们将其馈入零长度的双端队列。 time_test代码对每次运行Timer的3个结果进行排序。这是因为 minimum 结果很重要,其他值仅表示测试运行时系统的差异。有关详细信息,请参阅文档中的Timer.repeat注释。

from heapq import heappush, heappop, merge
from random import seed, randrange
from timeit import Timer
from collections import deque

seed(163)

# Brute force method, as a generator
def sorted_prod_brute(xs, ys):
    yield from sorted(x * y for x in xs for y in ys)

# By Tim Peters
def upprod1(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 4-tuples:
    #     (product, xs index, ys index, xs[xs index])
    h = [(xs[0] * ys[0], 0, 0, xs[0])]
    while h:
        prod, xi, yi, x = heappop(h)
        yield prod
        # same x with next y
        yi += 1
        if yi < lenys:
            heappush(h, (x * ys[yi], xi, yi, x))
        # if this is the first time we used x, start
        # the next x going
        if yi == 1:
            xi += 1
            if xi < lenxs:
                x = xs[xi]
                heappush(h, (x * ys[0], xi, 0, x))

# By Tim Peters
def upprod2(xs, ys):
    # xs and ys must be sorted, and non-negative
    from heapq import heappush, heappop
    # make xs the shorter
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    lenxs = len(xs)
    lenys = len(ys)
    # the heap holds 3-tuples:
    #     (product, xs index, ys index)
    h = [(xs[0] * ys[0], 0, 0)]

    # interior points for which only one immediate predecessor has
    # been processed; there's no need to put them in the heap
    # until their second predecessor has been processed too
    pending = set()

    def add(xi, yi):
        if xi < lenxs and yi < lenys:
            doit = True
            if xi and yi: # if either is 0, only one predecessor
                p = xi, yi
                if p in pending:
                    pending.remove(p)
                else:
                    pending.add(p)
                    doit = False
            if doit:
                heappush(h, (xs[xi] * ys[yi], xi, yi))
    while h:
        prod, xi, yi = heappop(h)
        yield prod
        # same x with next y; and same y with next x
        add(xi, yi + 1)
        add(xi + 1, yi)
    assert not pending

def sorted_prod_merge(xs, ys):
    ''' mergesort generators of the rows. '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    def gen(x):
        for y in ys:
            yield x * y
    yield from merge(*[gen(x) for x in xs])

def sorted_prod_row(xs, ys):
    ''' Heapsort, row by row.
        Fast, but not space-efficient: the maximum 
        heap size grows to almost len(ys) * len(xs)
    '''
    if len(ys) < len(xs):
        xs, ys = ys, xs
    if not xs:
        return
    x, xs = xs[0], xs[1:]
    heap = []
    #big = 0
    for y in ys:
        lo = x * y
        while heap and heap[0] <= lo:
            yield heappop(heap)
        yield lo
        for u in xs:
            heappush(heap, u * y)
        #big = max(big, len(heap))
    #print(big)
    while heap:
        yield heappop(heap)

def sorted_prod_diag(xs, ys):
    ''' Heapsort, going along the diagonals
        50% slower than sorted_prod_row, but more
        space-efficient: the maximum heap size 
        grows to around 0.5 * len(ys) * len(xs)
    '''
    if not (xs and ys):
        return
    lenxs, lenys = len(xs), len(ys)
    heap = []
    #big = 0
    for n in range(lenxs + lenys - 1):
        row = sorted(xs[n - i] * ys[i]
            for i in range(max(0, n + 1 - lenxs), min(lenys, n + 1)))
        lo = row[0]
        while heap and heap[0] <= lo:
            yield heappop(heap)
        yield lo
        for u in row[1:]:
            heappush(heap, u)
        #big = max(big, len(heap))
    #print(big)
    #assert not heap

def sorted_prod_block(xs, ys):
    ''' yield the top left corner, then merge sort
        the top row, the left column and the remaining 
        block. So we end up with max(len(xs), len(ys))
        recursively nested calls to merge(). It's ok
        for small lists, but too slow otherwise.
    '''
    if not (xs and ys):
        return
    x, *xs = xs
    y, *ys = ys
    yield x * y
    row = (y * u for u in xs)
    col = (x * v for v in ys)
    yield from merge(row, col, sorted_prod_block(xs, ys))

def sorted_prod_blockI(xs, ys):
    ''' Similar to sorted_prod_block except we use indexing
        to avoid creating sliced copies of the lists
    '''
    lenxs, lenys = len(xs), len(ys)
    def sorted_block(xi, yi):
        if xi == lenxs or yi == lenys:
            return
        x, y = xs[xi], ys[yi]
        yield x * y
        xi, yi = xi + 1, yi + 1
        row = (xs[i] * y for i in range(xi, lenxs))
        col = (ys[i] * x for i in range(yi, lenys))
        yield from merge(row, col, sorted_block(xi, yi))
    yield from sorted_block(0, 0)

functions = (
    sorted_prod_brute,
    upprod1,
    upprod2,
    sorted_prod_merge,
    #sorted_prod_row,
    sorted_prod_diag,
    #sorted_prod_block,
    #sorted_prod_blockI,
)

UB = 1000

def verify(numtests, maxlen=10):
    print('Verifying. maxlen =', maxlen)
    for k in range(numtests):
        lenxs = randrange(maxlen + 1)
        lenys = randrange(maxlen + 1)
        print(k, ':', lenxs, '*', lenys, '=', lenxs * lenys)
        xs = sorted(randrange(UB) for i in range(lenxs))
        ys = sorted(randrange(UB) for i in range(lenys))
        good = list(sorted_prod_brute(xs, ys))

        for func in functions[1:]:
            result = list(func(xs, ys))
            if result != good:
                print(func.__name__, 'failed!')
    print()

def time_test(loops=20):
    timings = []
    for func in functions:
        # Consume the generator output by feeding it to a zero-length deque
        t = Timer(lambda: deque(func(xs, ys), maxlen=0))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:18} : {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

verify(10, 10)
verify(20, 100)

print('\nTimings')
loops = 8192
minlen = 5
for k in range(6):
    lenxs = randrange(minlen, 2 * minlen)
    lenys = randrange(minlen, 2 * minlen)
    print(k, ':', loops, 'loops.', lenxs, '*', lenys, '=', lenxs * lenys)
    xs = sorted(randrange(UB) for i in range(lenxs))
    ys = sorted(randrange(UB) for i in range(lenys))
    time_test(loops)
    minlen *= 2
    loops //= 4

这是我的古老2GHz 32位单核计算机上的输出,该计算机在旧的Debian衍生版Linux上运行Python 3.6.0。 YMMV。

Verifying. maxlen = 10
0 : 8 * 9 = 72
1 : 9 * 0 = 0
2 : 1 * 7 = 7
3 : 8 * 10 = 80
4 : 10 * 5 = 50
5 : 10 * 0 = 0
6 : 5 * 2 = 10
7 : 5 * 10 = 50
8 : 3 * 0 = 0
9 : 0 * 6 = 0

Verifying. maxlen = 100
0 : 64 * 0 = 0
1 : 77 * 96 = 7392
2 : 24 * 13 = 312
3 : 53 * 39 = 2067
4 : 74 * 39 = 2886
5 : 92 * 97 = 8924
6 : 31 * 48 = 1488
7 : 39 * 17 = 663
8 : 42 * 25 = 1050
9 : 94 * 25 = 2350
10 : 82 * 83 = 6806
11 : 2 * 97 = 194
12 : 90 * 30 = 2700
13 : 93 * 24 = 2232
14 : 91 * 37 = 3367
15 : 24 * 86 = 2064
16 : 70 * 15 = 1050
17 : 2 * 4 = 8
18 : 72 * 58 = 4176
19 : 25 * 84 = 2100


Timings
0 : 8192 loops. 7 * 8 = 56
sorted_prod_brute  : 0.659312, 0.665853, 0.710947
upprod1            : 1.695471, 1.705061, 1.739299
sorted_prod_merge  : 1.990161, 1.991129, 2.001242
sorted_prod_diag   : 3.013945, 3.018927, 3.053115
upprod2            : 3.582396, 3.586332, 3.622949

1 : 2048 loops. 18 * 16 = 288
sorted_prod_brute  : 0.826128, 0.840111, 0.863559
upprod1            : 2.240931, 2.241636, 2.244615
sorted_prod_merge  : 2.301838, 2.304075, 2.306918
sorted_prod_diag   : 3.030672, 3.053302, 3.135322
upprod2            : 4.860378, 4.949804, 4.953891

2 : 512 loops. 39 * 32 = 1248
sorted_prod_brute  : 0.907932, 0.918692, 0.942830
sorted_prod_merge  : 2.559567, 2.561709, 2.604387
upprod1            : 2.700482, 2.701147, 2.757695
sorted_prod_diag   : 2.961776, 2.965271, 2.995747
upprod2            : 5.563303, 5.654425, 5.656695

3 : 128 loops. 68 * 70 = 4760
sorted_prod_brute  : 0.823448, 0.827748, 0.835049
sorted_prod_merge  : 2.591373, 2.592134, 2.685534
upprod1            : 2.760466, 2.763615, 2.795082
sorted_prod_diag   : 2.789673, 2.828662, 2.848498
upprod2            : 5.483504, 5.488450, 5.517847

4 : 32 loops. 122 * 156 = 19032
sorted_prod_brute  : 0.873736, 0.880958, 0.892846
sorted_prod_merge  : 2.701089, 2.742456, 2.818822
upprod1            : 2.875358, 2.881793, 2.922569
sorted_prod_diag   : 2.953450, 2.988184, 3.012430
upprod2            : 5.780552, 5.812967, 5.826775

5 : 8 loops. 173 * 309 = 53457
sorted_prod_brute  : 0.711012, 0.711816, 0.721627
sorted_prod_merge  : 1.997386, 1.999774, 2.033489
upprod1            : 2.137337, 2.172369, 3.335119
sorted_prod_diag   : 2.324447, 2.329552, 2.331095
upprod2            : 4.278704, 4.289019, 4.324436

答案 2 :(得分:-2)

似乎没有其他方法可以在不创建列表的情况下对这些输出进行排序,因为没有存储就无法对输出进行排序。这是您的方法。

myList = []

for i in range(len(l1)):
    for j in range(len(l2)):
        output = l1[i] * l2[j]
        myList.append(output)
myList.sort()
print(myList)

希望有帮助。