我正在尝试计算不是来自额外信用的活动价值。
输入:
+------+--------+------+
|period|activity|credit|
+------+--------+------+
| 1| 5| 0|
| 2| 0| 3|
| 3| 4| 0|
| 4| 0| 3|
| 5| 1| 0|
| 6| 1| 0|
| 7| 5| 0|
| 8| 0| 1|
| 9| 0| 1|
| 10| 5| 0|
+------+--------+------+
输出:
rdd = sc.parallelize([(5,0,5),(0,3,0),(4,0,1),(0,3,0),(1,0,0),(1,0,0),(5,0,4),(0,1,0),(0,1,0),(5,0,3)])
df = rdd.toDF(["activity","credit","realActivity"])
+------+--------+------+------------+
|period|activity|credit|realActivity|
+------+--------+------+------------+
| 1| 5| 0| 5|
| 2| 0| 3| 0|
| 3| 4| 0| 1|
| 4| 0| 3| 0|
| 5| 1| 0| 0|
| 6| 1| 0| 0|
| 7| 5| 0| 4|
| 8| 0| 1| 0|
| 9| 0| 1| 0|
| 10| 5| 0| 3|
+------+--------+------+------------+
我尝试创建一个信用余额列,根据行类型添加和减少,但我无法根据自身有条件地重新启动它(每次低于零)。它看起来像一个递归问题,我不知道如何转换成pyspark友好。显然,我不能做以下事情,自我引用前一个值..
w = Window.orderBy("period")
df = df.withColumn("realActivity", lag("realActivity",1,0).over(w) - lag("credit", 1, 0).over(w) - lag("activity",1,0).over(w) )
更新 正如有人指出的那样,窗口计算是不可能的。因此,我想做一些像下面的代码片段来计算creditBalance,让我计算realActivity。
df['creditBalance']=0
for i in range(1, len(df)):
if (df.loc[i-1, 'creditBalance']) > 0:
df.loc[i, 'creditBalance'] = df.loc[i-1, 'creditBalance'] + df.loc[i, 'credit'] - df.loc[i, 'activity']
elif df.loc[i, 'creditamount'] > 0:
df.loc[i, 'creditBalance'] = df.loc[i, 'credit'] - df.loc[i, 'activity']
现在,我唯一的问题是:如何将这个“本地”功能应用于火花数据框中的每个组?
答案 0 :(得分:0)
@pansen, 我用以下代码解决了这个问题。如果您正在尝试解决类似的问题,它可能很有用。
def creditUsage(rows):
'''
Input:
timestamp, activity, credit
['1;5;0', '2;0;3', '3;4;0', '4;0;3', '5;1;0', '6;1;0', '7;5;0', '8;0;1', '9;0;1', '10;5;0']
Output:
[timestamp; creditUsage]
'''
timestamps = [int(r.split(";")[0]) for r in rows]
rows = [r for _,r in sorted(zip(timestamps,rows))]
print(rows)
timestamp, trActivity, credit = zip(*[(int(ts), float(act), float(rbonus)) for r in rows for [ts, act, rbonus] in [r.split(";")]])
creditBalance,creditUsage = [0.0] * len(credit), [0.0] * len(credit)
for i in range(0, len(trActivity)):
creditBalance[i] = creditBalance[i-1]+credit[i]
""" if bonusBalance greater than activity then actitivity is the usage, if not, than bonusBalance """
creditUsage[i] = creditBalance[i] if creditBalance[i] - trActivity[i] <0 else trActivity[i]
creditBalance[i] += (- creditUsage[i])
output = ["{0};{1:02}".format(t_, r_) for t_, r_ in zip(timestamp, creditUsage)]
return(output)
realBonusUDF = udf(creditUsage,ArrayType(StringType()))
a= df.withColumn('data', concat_ws(';', col('period'), col('activity'), col('credit'))) \
.groupBy('userID').agg(collect_list('data').alias('data')) \
.withColumn('data', realBonusUDF('data')) \
.withColumn("data", explode("data")) \
.withColumn("data", split("data", ";")) \
.withColumn("timestamp", col('data')[0].cast("int")) \
.withColumn("creditUsage", col('data')[1].cast("float")) \
.drop('data')
输出:
+------+---------+-----------+
|userID|timestamp|creditUsage|
+------+---------+-----------+
| 123| 1| 0.0|
| 123| 2| 0.0|
| 123| 3| 3.0|
| 123| 4| 0.0|
| 123| 5| 1.0|
| 123| 6| 1.0|
| 123| 7| 1.0|
| 123| 8| 0.0|
| 123| 9| 0.0|
| 123| 10| 2.0|
+------+---------+-----------+