Question

我有以下数据：

java.time

使用PySpark我正在尝试添加一列，该列根据当前行的开始时间显示已完成的事务数。我可以使用如下所示的简单代码在Pandas中做到这一点：

client_id,transaction_id,start,end,amount
1,1,2018-12-09,2018-12-11,1000
1,2,2018-12-19,2018-12-21,2000
1,3,2018-12-19,2018-12-31,3000
2,4,2018-11-09,2018-12-20,4000
2,5,2018-12-19,2018-12-21,5000
2,6,2018-12-22,2018-12-31,6000

产生数据框

import pandas as pd
df = pd.read_csv('transactions.csv')
df['closed_transactions'] = df.apply(lambda row: len(df[ (df['end'] < 
row['start']) & (df['client_id'] == row['client_id'])]), axis=1)

但是，要在PySpark中实现相同目的，我很难使相同的东西起作用。我可以使用Window函数为每个组添加一个简单的计数器，并且累积的总和也可以工作，但是鉴于当前行的数据，我无法获得已完成交易的数量。

client_id   transaction_id  start   end amount  closed_transactions
0   1   1   2018-12-09  2018-12-11  1000    0
1   1   2   2018-12-19  2018-12-21  2000    1
2   1   3   2018-12-19  2018-12-31  3000    1
3   2   4   2018-11-09  2018-12-20  4000    0
4   2   5   2018-12-19  2018-12-21  5000    0
5   2   6   2018-12-22  2018-12-31  6000    2

我现在的解决方法是将Spark数据帧转换为Pandas并进行广播，因此我可以在UDF中使用它，但是我希望有一个更优雅的解决方案来解决此问题。

非常感谢您的帮助！

Answer 1

正如我在评论中提到的，在client_id上将数据框与self连接起来，并在start_date<end_date处添加一个布尔列。现在，我们可以根据此布尔列的总和对start_date进行分组。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Window
import pyspark.sql.functions as psf

config = SparkConf().setMaster('local')
spark = SparkContext.getOrCreate(conf=config)
sqlContext = SQLContext(spark)

spark_df = sqlContext.read.csv('transactions.csv', header=True)

# Renaming columns for self join
df2 = spark_df
for c in df.columns:
    df2 = df2 if c == 'client_id' else df2.withColumnRenamed(c, 'x_{cl}'.format(cl=c))

# Joining with self on client ID
new_df = spark_df.join(df2, 'header')

# Creating the flag column and summing it by grouping on start_date
new_df = new_df.withColumn('valid_transaction', when(col('start_date')<col('x_end_date'),1).otherwise(0)).groupBy(['client_id', 'start_date']).agg(sum('valid_transaction'))

根据列值计算PySpark中的先前日期

1 个答案: