在Spark SQL中的时间戳上联接两个表

时间:2020-07-09 04:06:24

标签: sql apache-spark pyspark apache-spark-sql

我正在尝试通过以下查询联接两个表:

results = sqlContext.sql('SELECT * \
                          FROM airlines a \
                          INNER JOIN LATERAL ( \
                            SELECT * \
                            FROM weather w \
                            WHERE w.CALL_SIGN = a.ORIGIN  \
                              AND w.WEATHER_TIMESTAMP BETWEEN a.CRS_DEP_TIME - INTERVAL 15 MINUTES AND a.CRS_DEP_TIME + INTERVAL 15 MINUTES \
                            ORDER BY w.WEATHER_TIMESTAMP DESC \
                            LIMIT 1 ) \
                           ON a.ORIGIN = w.CALL_SIGN').cache()

我遇到一个无法在内部联接中引用airlines表的问题。我尝试添加LATERAL关键字,希望Spark SQL支持Postgres这样的关键字无济于事。我不确定如何解决此查询,有什么建议吗?

1 个答案:

答案 0 :(得分:0)

尝试一下

'SELECT * \
                          FROM airlines a \
                          INNER JOIN ( \
                            SELECT * \
                            FROM weather w \
                            INNER JOIN airlines a
                            WHERE w.CALL_SIGN = a.ORIGIN  \
                              AND w.WEATHER_TIMESTAMP BETWEEN a.CRS_DEP_TIME - INTERVAL 15 MINUTES AND a.CRS_DEP_TIME + INTERVAL 15 MINUTES \
                            ORDER BY w.WEATHER_TIMESTAMP DESC \
                            LIMIT 1 ) x \
                           ON a.ORIGIN = x.CALL_SIGN'