Question

我正在开发一个用于学习目的的个人 PySpark 项目，但我遇到了一个特殊问题。

我有一个包含 N 列的数据框 (df)，其中我想从下一列中减去每一列（例如 col1 - col2、col2 - col3、...、col(N+1) - colN）并将结果差异列保存在另一个数据框中。

我通过解析一个 JSON，保存到一个 Pandas 数据框（架构：日期列，每个项目的列）来生成这个 df，将列转换为行（每个日期有一个单独的 Items 列和列），然后转换它在火花 df 中。我这样做是因为在 Spark 中逐行操作似乎很难实现。

我将 df 的第一列（Items 列）移动到一个新的数据框 (ndf)，因此我只剩下以下架构（标题由日期组成，数据仅为整数）：

<头>

日期 1	日期2	日期3	...
104	98	98	...
223	135	80	...
143	122	114	...
91	79	73	...

我想从 Date1 列（例如 df.Date1 - df.Date2）的整数中减去 Date2 列的整数，并将结果列的值（带有较大列的标题 - Date1）作为保存/附加在已经存在的 ndf 数据帧中（我之前在其中移动了列的那个）。然后继续减去列 Date2 和列 Date3 (df.Date2 - df.Date3)，依此类推直到列 Date(N+1) - DateN，然后停止。

之前从 Items 列创建的新数据框 (ndf) 将如下所示：

<头>

项目	日期 1	日期2	...
项目 1	6	0	...
Item2	88	55	...
Item3	21	8	...
item4	12	6	...

实际上，我想查看每个项目从一个日期到下一个日期增加的数量。

我想在 for 循环中进行。类似的东西：

# get list of column headers
dates = df.columns
# for index and header in list
for idx, date in enumerate(dates):
    if idx < len(dates)-1:
        # calculate df columns subtraction and add differences column to ndf
        df = df.withColumn(f'diff-{date}', F.when((df[date] - df[dates[idx+1]]) < 0, 0)
                        .otherwise(df[date] - df[dates[idx+1]]))
        ndf = ndf.join(df.select(f'diff-{date}'), how='full')

但这很慢，我觉得 for 循环并没有真正考虑到 Spark 的优势，它可能比使用 map/lambda 慢得多。

Answer 1

我找到了 2 个解决方案：

对于转置数据框，正如我在上面的问题中所提出的，reddit r/dataengineering 上的一位用户帮助我解决了这个问题：

# list to save column subtractions
col_defs = []
# grab the date columns
date_cols = df.columns[1:]
# for index and column
for i, date in enumerate(date_cols):
    if i > 0:
        # save the difference between each 2 columns to the list
        col_defs.append((df[date_cols[i - 1]] - df[date]).alias(date))
# result df containing only the items column and the differences for each date
result = df.select('county', *col_defs)

如果我不转置数据框，我可以应用窗口函数，正如@mck 在对问题的评论中所推荐的那样。我更喜欢这种方式，因为我避免转置，列数也将保持不变。这个 resource on PySpark Window Functions 对我理解它们的运作方式很有帮助：

<头>

日期	Item1	Item2	...
日期 1	104	223	...
日期2	98	135	...
日期 3	98	80	...

# list to save column subtractions
colDiffs= []
# get only the item columns
itemCols = df.columns[1:]
# Window function spec to partition the entire df and sort it by Dates descending as there are no dates that show multiple times.
windowSpec = Window.partitionBy().orderBy(F.col('Dates').desc())
# for each item column
for item in itemCols:
    # add a new column, itemdiff, to the df containing the same numbers but shifted up by one 
    # e.g. if a column X contains the numbers [1, 2, 3], applying the lead window function with 1 as argument, will shift everything up by 1 and the new Xdiff column will contain [2, 3, none]
    df = df.withColumn(f'{item}diff', lead(item, 1).over(windowSpec))
    # append the difference between the current and the lead colum to the list
    colDiffs.append((df[item] - df[f'{item}diff']).alias(item))
# get the final df containing the subtraction results
result = df.select('Dates', *colDiffs)

Pyspark 从下一列中减去数据帧列并将结果保存到另一个数据帧

1 个答案: