Pyspark-选择至少连续2天出现的用户

时间:2019-02-09 13:36:24

标签: apache-spark pyspark apache-spark-sql

我有一个数据框dataframe_actions,其字段为:user_idactiondayuser_id对于每个用户都是唯一的,而day的取值范围是1到31。我只想过滤掉连续至少两天看到的用户,例如:

如果在第1,2,4,8,9天看到用户,我希望保留他们,因为他们至少连续2天出现了。

我现在正在做的事情很笨拙,而且速度很慢(似乎没有用):

df_final = spark.sql(""" with t1( select user_id, day, row_number()
           over(partition by user_id order by day)-day diff from dataframe_actions), 
           t2( select user_id, day, collect_set(diff) over(partition by user_id) diff2 from t1) 
           select user_id, day from t2 where size(diff2) > 2""")

类似的事情,但我不知道该如何解决。

编辑:

| user_id | action | day |
--------------------------
| asdc24  | conn   |  1  |
| asdc24  | conn   |  2  |
| asdc24  | conn   |  5  |
| adsfa6  | conn   |  1  |
| adsfa6  | conn   |  3  |
| asdc24  | conn   |  9  |
| adsfa6  | conn   |  5  |
| asdc24  | conn   |  11 |
| adsfa6  | conn   |  10 |
| asdc24  | conn   |  15 |

应该返回

| user_id | action | day |
--------------------------
| asdc24  | conn   |  1  |
| asdc24  | conn   |  2  |
| asdc24  | conn   |  5  |
| asdc24  | conn   |  9  |
| asdc24  | conn   |  11 |
| asdc24  | conn   |  15 |

,因为仅此用户已连接至少两天(第1天和第2天)。

2 个答案:

答案 0 :(得分:2)

使用给定输入的另一种SQL方法。

Pyspark

- name: AP_DB
      description: "datasource"
      jndiConfig:
        name: jdbc/AP_DB
      definition:
        type: RDBMS
        configuration:
          jdbcUrl: 'jdbc:mysql://localhost:3306/sd?autoReconnect=true'
          username: username
          password: password
          driverClassName: com.mysql.jdbc.Driver

Scala版本

>>> from pyspark.sql.functions import *
>>> df = sc.parallelize([("asdc24","conn",1),
... ("asdc24","conn",2),
... ("asdc24","conn",5),
... ("adsfa6","conn",1),
... ("adsfa6","conn",3),
... ("asdc24","conn",9),
... ("adsfa6","conn",5),
... ("asdc24","conn",11),
... ("adsfa6","conn",10),
... ("asdc24","conn",15)]).toDF(["user_id","action","day"])
>>> df.createOrReplaceTempView("qubix")
>>> spark.sql(" select * from qubix order by user_id, day").show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| adsfa6|  conn|  1|
| adsfa6|  conn|  3|
| adsfa6|  conn|  5|
| adsfa6|  conn| 10|
| asdc24|  conn|  1|
| asdc24|  conn|  2|
| asdc24|  conn|  5|
| asdc24|  conn|  9|
| asdc24|  conn| 11|
| asdc24|  conn| 15|
+-------+------+---+

>>> spark.sql(""" with t1 (select user_id,action, day,lead(day) over(partition by user_id order by day) ld from qubix), t2 (select user_id from t1 where ld-t1.day=1 ) select * from qubix where user_id in (select user_id from t2) """).show()
+-------+------+---+
|user_id|action|day|
+-------+------+---+
| asdc24|  conn|  1|
| asdc24|  conn|  2|
| asdc24|  conn|  5|
| asdc24|  conn|  9|
| asdc24|  conn| 11|
| asdc24|  conn| 15|
+-------+------+---+

>>>

答案 1 :(得分:1)

使用lag获取每个用户的前一天,从当前行的日期中减去前一天,然后检查其中至少一个是否为1。这是通过group by和{ {1}}。

filter

另一种使用行号不同的方法。这样可以为给定的user_id选择所有列。

from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy(dataframe_actions.user_id).orderBy(dataframe_actions.day)
user_prev = dataframe_actions.withColumn('prev_day_diff',dataframe_actions.day-f.lag(dataframe_actions.day).over(w))
res = user_prev.groupBy(user_prev.user_id).agg(f.sum(f.when(user_prev.prev_day_diff==1,1).otherwise(0)).alias('diff_1'))
res.filter(res.diff_1 >= 1).show()