我有一个客户的Spark DataFrame,如下所示。
#SparkR code
customers <- data.frame(custID = c("001", "001", "001", "002", "002", "002", "002"),
date = c("2017-02-01", "2017-03-01", "2017-04-01", "2017-01-01", "2017-02-01", "2017-03-01", "2017-04-01"),
value = c('new', 'good', 'good', 'new', 'good', 'new', 'bad'))
customers <- createDataFrame(customers)
display(customers)
custID| date | value
--------------------------
001 | 2017-02-01| new
001 | 2017-03-01| good
001 | 2017-04-01| good
002 | 2017-01-01| new
002 | 2017-02-01| good
002 | 2017-03-01| new
002 | 2017-04-01| bad
在custID
的第一个月观察中,客户获得了value
新的&#39;。此后,他们被归类为“好”&#39;或者“坏”#39;但是,客户可以从“好”中恢复。或者“坏”&#39;回到&#39; new&#39;如果他们开了第二个账户。当发生这种情况时,我想用&#39; 2&#39;标记客户。而不是&#39; 1&#39;,表示他们开了第二个帐户,如下所示。我怎么能在Spark中做到这一点? SparkR或PySpark命令都可以工作。
#What I want to get
custID| date | value | tag
--------------------------------
001 | 2017-02-01| new | 1
001 | 2017-03-01| good | 1
001 | 2017-04-01| good | 1
002 | 2017-01-01| new | 1
002 | 2017-02-01| good | 1
002 | 2017-03-01| new | 2
002 | 2017-04-01| bad | 2
答案 0 :(得分:0)
在pyspark:
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
# df is equal to your customers dataframe
df = spark.read.load('file:///home/zht/PycharmProjects/test/text_file.txt', format='csv', header=True, sep='|').cache()
df_new = df.filter(df['value'] == 'new').withColumn('tag', f.rank().over(Window.partitionBy('custID').orderBy('date')))
df = df_new.union(df.filter(df['value'] != 'new').withColumn('tag', f.lit(None)))
df = df.withColumn('tag', f.collect_list('tag').over(Window.partitionBy('custID').orderBy('date'))) \
.withColumn('tag', f.UserDefinedFunction(lambda x: x.pop(), IntegerType())('tag'))
df.show()
输出:
+------+----------+-----+---+
|custID| date|value|tag|
+------+----------+-----+---+
| 001|2017-02-01| new| 1|
| 001|2017-03-01| good| 1|
| 001|2017-04-01| good| 1|
| 002|2017-01-01| new| 1|
| 002|2017-02-01| good| 1|
| 002|2017-03-01| new| 2|
| 002|2017-04-01| bad| 2|
+------+----------+-----+---+
顺便说一下, pandas 可以做到这一点。
答案 1 :(得分:0)
这可以使用以下代码完成:
使用&#34; new&#34;
过滤掉所有记录df_new<-sql("select * from df where value="new")
createOrReplaceTempView(df_new,"df_new")
df_new<-sql("select *,row_number() over(partiting by custID order by date)
tag from df_new")
createOrReplaceTempView(df_new,"df_new")
df<-sql("select custID,date,value,min(tag) as tag from
(select t1.*,t2.tag from df t1 left outer join df_new t2 on
t1.custID=t2.custID and t1.date>=t2.date) group by 1,2,3")