在组排序表中加入另一个表,首先使用func

时间:2019-05-30 03:53:22

标签: sql apache-spark pyspark

首先,我使用窗口功能按t1对表charge_time进行排序,然后将t1t2的连接放在user_id上。 如果t1有多个记录,我想获得第一条记录。 我使用first func来实现这一点

    _df = ss.sql("""
                    SELECT 
                        t1.user_id,
                        t1.pay_id,
                        t1.sku_mode,
                        t1.charge_time,
                        t1.exchange_type_t01,
                        ROW_NUMBER() OVER(PARTITION BY t1.user_id ORDER BY t1.charge_time)
                    FROM 
                        {} t1 
                    WHERE 
                        t1.refund_state = 0
                """.format(exchange_info_table))
    _df.createOrReplaceTempView('d_exchange_info')

    df = ss.sql("""
            SELECT 
                first(t1.sku_mode) AS sku_mode,
                first(t1.exchange_type_t01) AS exchange_type_t01,
                first(t1.user_id) AS user_id,
                first(t1.pay_id) AS pay_id,
                first(t1.charge_time) AS charge_time,
                first(t2.has_yxs_payment) AS has_yxs_payment,
                first(t2.has_sxy_payment) AS has_sxy_payment,
                first(t2.has_cxy_payment) AS has_cxy_payment,
                first(t2.has_sxy19_payment) AS has_sxy19_payment,
                first(t2.sxy19_join_time) AS sxy19_join_time,
                first(t2.yxs_join_time) AS yxs_join_time
            FROM
                d_exchange_info t1
            JOIN
                analytics_db.md_day_dump_users t2
            ON 
                t2.the_day = '{}'
                AND t1.user_id = t2.user_id
            GROUP BY
                t1.user_id
    """.format(st))

我使用first func,但按charge_time记录排序会得到不稳定的记录。 如果记录多个,有时我得到一个,有时又得到另一个

为什么会发生以及如何解决? 这是一个sparksql问题,或者我的SQL有问题吗?

PS:我已经知道如何用另一种方法修复它,但我想知道为什么第一个功能无效。

谢谢!

1 个答案:

答案 0 :(得分:1)

我对火花不是很了解,但是从文档中可以了解到

The function is non-deterministic because its results depends on order of rows 
which may be non-deterministic after a shuffle.

您的窗口函数似乎正在产生row_number,但是您没有在任何地方使用它。

您需要对结果集进行排序,或者要使用生成的行号,然后添加where row_number=1。您还必须命名row_number列,除非它由spark明确完成。