Question

我有一个PySpark RDD，其中每个元素的格式为(key, val)，它采用以下两种形式之一：

elm1 = ((1, 2), ((3, 4), (5, 6)))  # key = (1,2), rest is val
elm2 = ((1, 2), ((3, 4), None))

现在，我需要做以下几点。

检测val的第二部分为无的元素（如elm2中所示）并提取它们。
将它们展平如下，并用空字符串元组替换None：
```
elm = (1, 2, 3, 4, ('', ''))
```

要在PySpark中执行以上两个步骤，我会这样做：

elm = elm.filter(lambda x: detectNone(x))  # checks if x[-1][1] is None
elm = elm.map(formatElm) # where formatElm is a function that replaces None with tuple of empty strings and flattens the tuple.

实际上，测试x[-1][1] == None有点复杂，并且引入了更复杂的数据结构来代替空字符串元组。

问题：有没有办法加快这些操作。我认为将两个操作组合成一个可能有所帮助，但我不知道该怎么做。

Answer 1

我认为将两个操作合二为一可能会有所帮助，

不会。但如果你真的坚持这样做，那么flatMap：

rdd = sc.parallelize([((1, 2), ((3, 4), (5, 6))), ((1, 2), ((3, 4), None))])


def detect_and_format(row):
    x, (y, z) = row
    return [x + y + (("", ""), )] if z is None else []

# [(1, 2, 3, 4, ('', ''))]

同时修改和过滤PySpark RDD

1 个答案: