我有复杂的逻辑实现,尝试了一段时间但仍然没有线索,请帮助检查是否现实和如何做。非常感谢你!
我有以下SparkSQL
数据框(datetime
正在增加,'type'重复出现,每个部分(不同类型)始终以'flag'=1
开头,):
+---------+-----+----+-----+
|datetime |type |flag|value|
+---------+-----+----+-----+
|20170901 |A |1 | 560|
|20170902 |A |0 | 3456|
|20170903 |A |0 | 50|
|20170904 |A |0 | 789|
......
|20170912 |B |1 | 345|
|20170913 |B |0 | 4510|
|20170915 |B |0 | 508|
......
|20170919 |C |1 | 45|
|20170923 |C |0 | 410|
|20170925 |C |0 | 108|
......
|20171001 |A |1 | 198|
|20171002 |A |0 | 600|
|20171005 |A |0 | 675|
|20171008 |A |0 | 987|
......
我需要根据前一行和当前行创建一个计算列,以获得这样的数据帧(计算字段-Seq表示增加段序列):
+---------+-----+----+-----+-----+
|datetime |type |flag|value| Seq|
+---------+-----+----+-----+-----+
|20170901 |A |1 | 560| 1|
|20170902 |A |0 | 3456| 1|
|20170903 |A |0 | 50| 1|
|20170904 |A |0 | 789| 1|
......
|20170912 |B |1 | 345| 2|
|20170913 |B |0 | 4510| 2|
|20170915 |B |0 | 508| 2|
......
|20170919 |C |1 | 45| 3|
|20170923 |C |0 | 410| 3|
|20170925 |C |0 | 108| 3|
......
|20171001 |A |1 | 198| 4|
|20171002 |A |0 | 600| 4|
|20171005 |A |0 | 675| 4|
|20171008 |A |0 | 987| 4|
......
任何线索都表示赞赏。 我写了代码(感谢https://stackoverflow.com/users/1592191/mrsrinivas):
from pyspark.sql import SQLContext, Row
from pyspark import SparkConf, SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as func
import sys
conf = SparkConf().setMaster("local[2]")
conf = conf.setAppName("test")
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
rdd = sc.parallelize([(20170901,"A",1,560), (20170902,"A",0,3560), (20170903,"A",0,50), (20170904,"A",0,56),
(20170912,"B",1,345), (20170913,"B",0,4510), (20170915,"B",0,453),
(20170919,"C",1,55), (20170923,"C",0,410), (20170925,"C",0,108),
(20171001,"A",1,189), (20171002,"A",0,600), (20171005,"A",0,650), (20171008,"A",0,956)])
df = spark.createDataFrame(rdd, ["datatime", "type", "flag", "value"])
df.show()
windowSpec = Window.partitionBy(df['type']).orderBy(df['flag'].desc()).rangeBetween(-sys.maxsize, sys.maxsize)
df.withColumn('Seq', func.dense_rank().over(windowSpec))
df.show()
但遇到错误:Py4JJavaError:调用o514.withColumn时发生错误。 :org.apache.spark.sql.AnalysisException:无框架前进和无限下行之间的窗口框架范围必须匹配所需的框架在无限制前进和当前行之间的行数; 有什么想法吗?
答案 0 :(得分:0)
希望这有帮助!
输出是:
from pyspark.sql.window import Window
from pyspark.sql.functions import col, monotonically_increasing_id, when, last
import sys
#sample data
df = sc.parallelize([(20170901,"A",1,560), (20170902,"A",0,3560), (20170903,"A",0,50), (20170904,"A",0,56),
(20170912,"B",1,345), (20170913,"B",0,4510), (20170915,"B",0,453),
(20170919,"C",1,55), (20170923,"C",0,410), (20170925,"C",0,108),
(20171001,"A",1,189), (20171002,"A",0,600), (20171005,"A",0,650), (20171008,"A",0,956)]).\
toDF(["datetime", "type", "flag", "value"])
df = df.withColumn("row_id",monotonically_increasing_id())
w = Window.partitionBy(col("type")).orderBy(col('datetime'))
df1 = df.withColumn("seq_temp", when(col('flag')==1, col('row_id')).otherwise(None))
df1 = df1.withColumn("seq", last('seq_temp', True).over(w.rowsBetween(-sys.maxsize, 0))).\
drop('row_id','seq_temp').\
sort('Seq')
df1.show()
+--------+----+----+-----+----------+
|datetime|type|flag|value| seq|
+--------+----+----+-----+----------+
|20170901| A| 1| 560| 0|
|20170902| A| 0| 3560| 0|
|20170903| A| 0| 50| 0|
|20170904| A| 0| 56| 0|
|20170913| B| 0| 4510| 4|
|20170912| B| 1| 345| 4|
|20170915| B| 0| 453| 4|
|20170919| C| 1| 55|8589934592|
|20170925| C| 0| 108|8589934592|
|20170923| C| 0| 410|8589934592|
|20171001| A| 1| 189|8589934595|
|20171008| A| 0| 956|8589934595|
|20171005| A| 0| 650|8589934595|
|20171002| A| 0| 600|8589934595|
+--------+----+----+-----+----------+
值不是完美的顺序,而是单调递增。
答案 1 :(得分:0)
您可以使用以下代码,我已将其修改为“A”两次
from pyspark.sql.window import Window
from pyspark.sql.functions import col, monotonically_increasing_id, when, last
from pyspark.sql.functions import lit
import sys
import pyspark.sql.functions as func
#sample data
df = sc.parallelize([(20170901,"A",1,560), (20170902,"A",0,3560), (20170903,"A",0,50), (20170904,"A",0,56),
(20170912,"B",1,345), (20170913,"B",0,4510), (20170915,"B",0,453),
(20170919,"C",1,55), (20170923,"C",0,410), (20170925,"C",0,108),
(20171001,"A",1,189), (20171002,"A",0,600), (20171005,"A",0,650), (20171008,"A",0,956)]).\
toDF(["datetime", "type", "flag", "value"])
df = df.withColumn("row_id",monotonically_increasing_id())
w = Window.partitionBy(col("type")).orderBy(col('datetime'))
df1 = df.withColumn("seq_temp", when(col('flag')==1, col('row_id')).otherwise(None))
df1 = df1.withColumn("seq", last('seq_temp', True).over(w.rowsBetween(-sys.maxsize, 0))).sort('Seq')
r = df1.withColumn('seq_item',lit(0))
windowSpec = Window.partitionBy(r['seq_item']).orderBy(r['seq'])
s = r.withColumn('seq_1',func.dense_rank().over(windowSpec)).drop('seq_temp','seq','seq_item','row_id')
s.show()
+--------+----+----+-----+--------+-----+
|datatime|type|flag|value|seq_item|seq_1|
+--------+----+----+-----+--------+-----+
|20170901| A| 1| 560| 0| 1|
|20170902| A| 0| 3560| 0| 1|
|20170903| A| 0| 50| 0| 1|
|20170904| A| 0| 56| 0| 1|
|20170912| B| 1| 345| 0| 2|
|20170913| B| 0| 4510| 0| 2|
|20170915| B| 0| 453| 0| 2|
|20170919| C| 1| 55| 0| 3|
|20170923| C| 0| 410| 0| 3|
|20170925| C| 0| 108| 0| 3|
|20171001| A| 1| 189| 0| 4|
|20171002| A| 0| 600| 0| 4|
|20171005| A| 0| 650| 0| 4|
|20171008| A| 0| 956| 0| 4|
+--------+----+----+-----+--------+-----+