Question

我希望对数据框执行按行操作，该操作将一些固定变量作为参数。我知道如何做到这一点的唯一方法是使用嵌套函数。我试图使用Cython编译我的部分代码，然后从mapPartitions中调用Cython函数，但是它引发了错误PicklingError: Can't pickle <cyfunction outer_function.<locals>._nested_function at 0xfffffff>。

使用纯Python时，我会这么做

def outer_function(fixed_var_1, fixed_var_2):
    def _nested_function(partition):
        for row in partition:
            yield dosomething(row, fixed_var_1, fixed_var_2)
    return _nested_function

output_df = input_df.repartition(some_col).rdd \
    .mapPartitions(outer_function(a, b))

现在我在一个单独的文件中定义了outer_function，就像这样

# outer_func.pyx

def outer_function(fixed_var_1, fixed_var_2):
    def _nested_function(partition):
        for row in partition:
            yield dosomething(row, fixed_var_1, fixed_var_2)
    return _nested_function

还有这个

# runner.py

from outer_func import outer_function

output_df = input_df.repartition(some_col).rdd \
    .mapPartitions(outer_function(a, b))

这会在上面引发酸洗错误。

我看过https://docs.databricks.com/user-guide/faq/cython.html，并试图获得outer_function。仍然，发生相同的错误。问题在于，嵌套函数不会出现在模块的全局空间中，因此无法找到并序列化它。

我也尝试过这样做

def outer_function(fixed_var_1, fixed_var_2):
    global _nested_function
    def _nested_function(partition):
        for row in partition:
            yield dosomething(row, fixed_var_1, fixed_var_2)
    return _nested_function

这将引发不同的错误AttributeError: 'module' object has no attribute '_nested_function'。

在这种情况下，有没有不使用嵌套函数的方法？还是有另一种方法可以使嵌套函数“可序列化”？

谢谢！

编辑：我也尝试过

# outer_func.pyx

class PartitionFuncs:

    def __init__(self, fixed_var_1, fixed_var_2):
        self.fixed_var_1 = fixed_var_1
        self.fixed_var_2 = fixed_var_2

    def nested_func(self, partition):
        for row in partition:
            yield dosomething(row, self.fixed_var_1, self.fixed_var_2)

# main.py

from outer_func import PartitionFuncs

p_funcs = PartitionFuncs(a, b)
output_df = input_df.repartition(some_col).rdd \
    .mapPartitions(p_funcs.nested_func)

我仍然得到PicklingError: Can't pickle <cyfunction PartitionFuncs.nested_func at 0xfffffff>。哦，好了，这个主意行不通。

Answer 1

这是一半的答案，因为当我尝试对您的class PartitionFuncs方法p_funcs.nested_func进行腌制/未腌制时（尽管我没有尝试将其与PySpark结合使用），所以是否以下解决方案是必要的，可能取决于您的Python版本/平台等。Pickle should support bound methods from Python 3.4，但是看起来像PySpark forces the pickle protocol to 3，这将停止该工作。可能有一些方法可以改变这种情况，但我不知道。

众所周知，嵌套函数不是可腌制的，因此该方法肯定可以工作。课堂教学法是正确的。

我在评论中的建议是仅尝试对类进行腌制，而不是对绑定函数进行腌制。为此，需要调用该类的实例，因此您将函数重命名为__call__

class PartitionFuncs:
    def __init__(self, fixed_var_1, fixed_var_2):
        self.fixed_var_1 = fixed_var_1
        self.fixed_var_2 = fixed_var_2

    def __call__(self, partition):
        for row in partition:
            yield dosomething(row, self.fixed_var_1, self.fixed_var_2)

这确实取决于两个fixed_var变量在默认情况下是否可腌制。如果不是，您可以写custom saving and loading methods, as described in the pickle documentation。

正如您在评论中指出的那样，这确实意味着您需要为定义的每个函数使用单独的类。这里的选项涉及继承，因此有一个单独的PickleableData类，每个Func类都可以保留对它的引用。

使用Cython时在PySpark mapPartitions中使用嵌套函数的替代方法？

1 个答案: