Question

鉴于Spark DataFrame看起来像这样：

Pattern.compile("*Hello").matcher("*Hello").matches()

我想运行逻辑，该逻辑对表的分区执行聚合/计算，该分区对应于特定的\\*Hello值。所述逻辑要求分区的全部内容 - 和仅该分区 - 在执行逻辑的节点上的内存中实现;它看起来像下面的================================== | Name | Col1 | Col2 | .. | ColN | ---------------------------------- | A | 1 | 11 | .. | 21 | | A | 31 | 41 | .. | 51 | | B | 2 | 12 | .. | 22 | | B | 32 | 42 | .. | 52 | ==================================函数：

Name

我尝试通过基于processSegment列的重新分区，然后通过基础RDD上的def processDataMatrix(dataMatrix): # do some number crunching on a 2-D matrix def processSegment(dataIter): # "running" value of the Name column in the iterator dataName = None # as the iterator is processed, put the data in a matrix dataMatrix = [] for dataTuple in dataIter: # separate the name column from the other columns (name, *values) = dataTuple # SANITY CHECK: ensure that all rows have same name if (dataName is None): dataName = name else: assert (dataName == name), 'row name ' + str(name) + ' does not match expected ' + str(dataName) # put the row in the matrix dataMatrix.append(values) # if any rows were processed, number-crunch the matrix if (dataName is not None): return processDataMatrix(dataMatrix) else: return []在每个分区上运行Name来完成此工作：

processSegment

但是，该过程通常会失败mapPartitions中的result = \ stacksDF \ .repartition('Name') \ .rdd \ .mapPartitions(processSegment) \ .collect()断言：

SANITY CHECK

当我尝试在底层RDD上运行processSegment时，为什么在DataFrame上表面上执行的分区没有被保留？如果上面的方法无效，是否有一些方法（使用DataFrame API或RDD API），这将使我能够在DataFrame分区的内存中再现上执行聚合逻辑？

（因为我正在使用PySpark，我希望执行的特定数字运算逻辑是Python，用户定义的聚合函数（UDAF）would not appear to be an option。）

Answer 1

我相信你误解了分区是如何工作的。一般来说，分区是一个射影函数，而不是一个双射函数。虽然特定值的所有记录都将移动到单个分区，但分区可能包含具有多个不同值的记录。

DataFrame API并未对您提供对分区程序的任何控制，但在使用partitionFunc API时可以定义自定义RDD。这意味着你可以使用一个双射的，例如：

mapping = (df
    .select("Name")
    .distinct()
    .rdd.flatMap(lambda x: x)
    .zipWithIndex()
    .collectAsMap())

def partitioner(x):
    return mapping[x]

并按如下方式使用：

df.rdd.map(lambda row: (row.Name, row)).partitionBy(len(mapping), partitioner)

虽然您可能必须记住分区不是免费的，如果唯一值的数量很大，它可能会成为一个严重的性能问题。

在RDD转换上保留Spark DataFrame列分区

1 个答案: