在pyspark中,当groupby不适用时,对id进行操作的最佳方法是什么?这是示例代码:
for id in [int(i.id) for i in df.select('id').distinct().collect()]:
temp = df.where("id == {}".format(id))
temp = temp.sort("date")
my_window = Window.partitionBy().orderBy("id")
temp = temp.withColumn("prev_transaction",lag(temp['date']).over(my_window))
temp = temp.withColumn("diff", temp['date']-temp["prev_transaction"))
temp = temp.where('day_diff > 0')
#select a row and so on
最优化的最佳方法是什么?
答案 0 :(得分:0)
我想您有要向其添加以前交易列的交易数据框。
import {Component} from "@angular/core";
@Component({
selector: "names",
template:
`<h2>{{title}}</h2>
<ul>
<li>{{namesList[0]}}</li>
<li>{{namesList[1]}}</li>
<li>{{namesList[2]}}</li>
<li>{{namesList[3]}}</li>
</ul>`
})
export class Comp1Component {
title = "Hello!";
namesList:string[];
firstNames: string[] = ["Harry","Hermione","Ron","Draco"];
lastNames:string[] = ["Potter","Granger","Weasley","Malfoy"];
constructor() {
for(let i =0;i<4;i++) {
namesList.push(this.firstNames[i]+" "+this.lastNames[i]);
}
}
}
上面的代码将在之前的事务之前添加数组列,直到该事务为止。但是最后的事务也将添加到数组中。您可以通过简单的自定义udf删除最后一笔交易,例如:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy('id').orderBy('date')
df.withColumn('pre_transactions', F.collect_list('id') \
.over(windowSpec.rangeBetween(Window.unboundedPreceding, 0)))