Question

我在PySpark工作并有一个表格，其中包含特定文章的销售数据，每个日期和文章有一行：

#ARTICLES
+-----------+----------+
|timestamp  |article_id|
+-----------+----------+
| 2018-01-02|   1111111|
| 2018-01-02|   2222222|
| 2018-01-02|   3333333|
| 2018-01-03|   1111111|
| 2018-01-03|   2222222|
| 2018-01-03|   3333333|
+-----------+----------+

然后，我有一个较小的表格，其中包含每篇文章的价格数据。价格从某个日期到某个其他日期有效，该日期在最后两列中指定：

#PRICES
+----------+-----+----------+----------+
|article_id|price|from_date |to_date   |
+----------+-----+----------+----------+
|   1111111| 8.99|2000-01-01|2999-12-31|
|   2222222| 4.29|2000-01-01|2006-09-05|
|   2222222| 2.29|2006-09-06|2999-12-31|
+----------+-----+----------+----------+

在这里的最后两行中，您会看到此价格在2006-09-06已经降低。

我现在想把价格加到第一张桌子上。它必须是各自时间戳的价格。在这个例子中，我想得到以下结果：

#RESULT
+-----------+----------+-----+
|timestamp  |article_id|price|
+-----------+----------+-----+
| 2018-01-02|   1111111| 8.99|
| 2018-01-02|   2222222| 2.29|
| 2018-01-02|   3333333| null|
| 2018-01-03|   1111111| 8.99|
| 2018-01-03|   2222222| 2.29|
| 2018-01-03|   3333333| null|
+-----------+----------+-----+

我最好怎么做？

我的一个想法是“推出”价格表，每个时间戳和article_id包含一行，然后使用这两个键加入。但我不知道如何使用两个日期列推出表格。

Answer 1

加入条件之间应该有效。

from pyspark.sql.functions import col
articles.alias('articles').join(prices.alias('prices'), 
   on=(
        (col('articles.article_id') == col('prices.article_id')) & 
        (col('articles.timestamp').between(col('prices.from_date'), col('prices.to_date')))
   ),
   how='left'
).select('articles.*','prices.price')

Answer 2

另一种选择是进行左连接，然后使用pyspark.sql.functions.where()选择price。

import pyspark.sql.functions as f
articles.alias("a").join(prices.alias("p"), on="article_id", how="left")\
    .where(
        f.col("p.article_id").isNull() |  # without this, it becomes an inner join
        f.col("timestamp").between(
            f.col("from_date"),
            f.col("to_date")
        )
    )\
    .select(
        "timestamp",
        "article_id",
        "price"
    )\
    .show()
#+----------+----------+-----+
#| timestamp|article_id|price|
#+----------+----------+-----+
#|2018-01-02|   1111111| 8.99|
#|2018-01-02|   2222222| 2.29|
#|2018-01-02|   3333333| null|
#|2018-01-03|   1111111| 8.99|
#|2018-01-03|   2222222| 2.29|
#|2018-01-03|   3333333| null|
#+----------+----------+-----+

Answer 3

这是实现理想结果的另一种方式

from pyspark.sql import functions as f
result = articles.alias('articles').join(prices.alias('prices'), (f.col('articles.article_id') == f.col('prices.article_id')) & (f.col('articles.timestamp') > f.col('prices.from_date')) & (f.col('articles.timestamp') < f.col('prices.to_date')), 'left')\
    .select('articles.*','prices.price')

result应该是

+----------+----------+-----+
|timestamp |article_id|price|
+----------+----------+-----+
|2018-01-02|2222222   |2.29 |
|2018-01-03|2222222   |2.29 |
|2018-01-02|3333333   |null |
|2018-01-03|3333333   |null |
|2018-01-02|1111111   |8.99 |
|2018-01-03|1111111   |8.99 |
+----------+----------+-----+

如何将带有'valid_from'和'valid_to'列的表连接到带有时间戳的表？

3 个答案: