Question

有人可以解释一下groupby操作和this SO帖子上使用的lambda函数吗？

key=lambda k, line=count(): next(line) // chunk

import tempfile
from itertools import groupby, count

temp_dir = tempfile.mkdtemp()

def tempfile_split(filename, temp_dir, chunk=4000000):
    with open(filename, 'r') as datafile:

    # The itertools.groupby() function takes a sequence and a key function,
    # and returns an iterator that generates pairs.

    # Each pair contains the result of key_function(each item) and
    # another iterator containing all the items that shared that key result.

        groups = groupby(datafile, key=lambda k, line=count(): next(line) // chunk)
        for k, group in groups:

            print(key, list(group))

            output_name = os.path.normpath(os.path.join(temp_dir + os.sep, "tempfile_%s.tmp" % k))
            for line in group:
                with open(output_name, 'a') as outfile:
                    outfile.write(line)

编辑：我花了一些时间来围绕与groupby一起使用的lambda函数。我认为我对他们中的任何一个都不太了解。

Martijn解释得非常好，但我有一个跟进问题。为什么每次都将line=count()作为参数传递给lambda函数？我尝试在函数外部仅将变量line分配给count()一次。

    line = count()
    groups = groupby(datafile, key=lambda k, line: next(line) // chunk)

，结果是TypeError: <lambda>() missing 1 required positional argument: 'line'

此外，直接在lambda表达式中调用next上的count()会导致输入文件中的所有行聚集在一起，即groupby函数生成单个键。

groups = groupby(datafile, key=lambda k: next(count()) // chunk)

我正在自己学习Python，因此非常感谢任何有关参考资料/ PyCon会谈的帮助或指示。什么都真的！

Answer 1

itertools.count()是增加整数的无限迭代器。

lambda将实例存储为关键字参数，因此每次调用lambda时，局部变量line都会引用该对象。 next()推进迭代器，检索下一个值：

>>> from itertools import count
>>> line = count()
>>> next(line)
0
>>> next(line)
1
>>> next(line)
2
>>> next(line)
3

因此next(line)检索序列中的下一个计数，并将该值除以chunk（仅取除除法的整数部分）。 k参数被忽略。

因为使用了整数除法，lambda的结果将是chunk重复递增的整数;如果chunk为3，则您获得0三次，然后1三次，然后2三次，等等：

>>> chunk = 3
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> chunk = 4
>>> l = lambda k, line=count(): next(line) // chunk
>>> [l('ignored') for _ in range(10)]
[0, 0, 0, 0, 1, 1, 1, 1, 2, 2]

正是这个结果值groupby()将datafile可迭代的组合在一起，生成chunk行的组。

使用groupby()循环for k, group in groups:结果时，k是lambda生成的数字，结果按分组排序;代码中的for循环忽略了这一点。 group是来自datafile的可迭代行，并且始终包含chunk行。

Answer 2

回应更新后的OP ......

itertools.groupby迭代器提供了将项目组合在一起的方法，在定义键功能时提供更多控制。有关how itertools.groupby() works的更多信息，请参阅。

lambda函数是编写常规函数的一种功能性简写方法。例如：

>>> keyfunc = lambda k, line=count(): next(line)

相当于这个常规功能：

>>> def keyfunc(k, line=count()):
...     return next(line) // chunk

关键字：迭代器，函数式编程，匿名函数

<强>详情

为什么每次都将line=count()作为参数传递给lambda函数？

正常功能的原因相同。 line参数本身是位置参数。分配值后，它将成为默认的关键字参数。有关positional vs. keyword arguments的更多信息，请参阅。

您仍然可以通过将结果分配给关键字参数来定义函数外的line=count()：

>>> chunk = 3
>>> line=count()
>>> keyfunc = lambda k, line=line: next(line) // chunk       # make `line` a keyword arg
>>> [keyfunc("") for _ in range(10)]
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3]
>>> [keyfunc("") for _ in range(10)]
[3, 3, 4, 4, 4, 5, 5, 5, 6, 6]                               # note `count()` continues

...直接在lambda表达式中的count()上调用next，导致输入文件中的所有行聚集在一起，即groupby函数生成单个键... < / p>

使用count()尝试以下实验：

>>> numbers = count()
>>> next(numbers)
0
>>> next(numbers)
1
>>> next(numbers)
2

正如预期的那样，您会注意到next()正在从count()迭代器中产生下一个项目。（类似的函数称为使用for循环迭代迭代器）。这里的独特之处在于生成器不会重置 - next()只是提供行中的下一个项目（如前一个示例所示）。

@Martijn Pieters指出next(line) // chunk计算一个被覆的整数，groupby使用该整数来标识每一行（将具有相似ID的相似行聚集在一起），这也是预期的。有关groupby如何运作的更多信息，请参阅参考资料。

参考

Docs for itertools.count
Docs for itertools.groupby()
Beazley，D。和Jones，B。“7.7捕获匿名函数中的变量”，Python Cookbook，第3版。奥赖利。 2013。

具有itertools count（）和groupby（）

2 个答案: