我有一个SQL Server查询,我正在尝试转换为在BigQuery中运行。涉及三个表:
CalendarMonths
FirstDayOfMonth | FirstDayOfNextMonth
----------------------------+----------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC
客户
clientid | name | etc.
---------+----------------+------
1 | Bob's Shop |
2 | Anne's Cookies |
ClientLogs
id | clientid | timestamp | price_current | price_old | license_count_current | license_count_old |
----+----------+----------------+---------------+-----------+-----------------------+---------------
1 | 1 | 2017-02-01 UTC | 1200 | 0 | 10 | 0 |
2 | 1 | 2018-02-03 UTC | 2400 | 1200 | 20 | 10 |
3 | 2 | 2016-07-13 UTC | 1200 | 0 | 10 | 0 |
4 | 2 | 2018-03-30 UTC | 0 | 1200 | 0 | 10 |
T-SQL查询看起来像这样:
SELECT
FirstDayOfMonth, FirstDayOfNextMonth,
(SELECT SUM(sizeatdatelog.price_current)
FROM clients c
CROSS APPLY (SELECT TOP 1 *
FROM clientlogs
WHERE clientid = c.clientid
AND [timestamp] < cm.FirstDayOfMonth
ORDER BY [timestamp] DESC) sizeatdatelog
WHERE sizeatdatelog.license_count_current > 0) as StartingRevenue,
(another subquery for starting client count) as StartingClientCount,
(another subquery for churned revenue) as ChurnedRevenue,
(there are about 6 other subqueries)
FROM
CalendarMonths cm
ORDER BY
cm.FirstDayOfMonth
最终输出如下:
FirstDayOfMonth | FirstDayOfNextMonth | StartingRevenue | StartingClientCount | etc
-------------------------------------------------------------------------------------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC | 68382995.43 | 79430 |
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC | 69843625.12 | 80430 |
在BigQuery中,我在select子句中添加了一个简单的子查询,它运行得很好:
SELECT FirstDayOfMonth, FirstDayOfNextMonth, (SELECT clientId FROM clientlogs LIMIT 1 ) as cl
FROM CalendarMonths cm
ORDER BY cm.FirstDayOfMonth
但是,只要我在子查询中添加了where子句,就会收到以下错误消息:
错误:不支持引用其他表的相关子查询,除非它们可以解相关,例如将它们转换为有效的JOIN。
我应该从这一点开始?如果我无法在一个查询中找到我正在寻找的结果,也许我应该考虑创建多个创建临时表的预定作业,然后是最后一个将它们连接在一起的预定作业。或者也许我可以通过GCP在代码中执行此操作或在应用程序脚本中使用BigQuery API。数据大小不大,查询不经常运行。我正在寻求可维护性而不是效率,所以理想情况下有一种方法可以将这些数据放入一个查询中。
答案 0 :(得分:3)
以下是BigQuery Standard SQL
#standardSQL
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
SUM(price_current) StartingRevenue, COUNT(1) StartingClientCount
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
clientid, price_current
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, clientid,
FIRST_VALUE(price_current) OVER(latest_values) price_current,
FIRST_VALUE(license_count_current) OVER(latest_values) license_count_current
FROM `project.dataset.CalendarMonths` cm
JOIN `project.dataset.ClientLogs` cl
ON `timestamp` < FirstDayOfMonth
WINDOW latest_values AS (PARTITION BY clientid ORDER BY `timestamp` DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
WHERE license_count_current > 0
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, clientid, price_current
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth
ORDER BY FirstDayOfMonth
以上最有可能扩展到其他子查询
答案 1 :(得分:0)
相关子查询,如
SELECT TOP 1 * 来自客户日志 clientid = c.clientid AND [timestamp]&lt; cm.FirstDayOfMonth ORDER BY [timestamp] DESC)
BigQuery中的通常需要通过
的聚合来重写SELECT ARRAY_AGG(foo ORDER BY [timestamp] DESC LIMIT 1)[offset(0)] 来自...作为foo WHERE相关条件
BigQuery更可能以
的形式使用简单的相关子查询选择 {可选聚合} 从表 WHERE {相关条件}
答案 2 :(得分:0)
为了社区,我发布了我最终使用的查询。非常感谢Mikhail Berlyant对他的帮助。
我最终将查询分解为CTE,因此我可以使用相关子查询来获取我需要的特定数据。
import org.apache.spark.sql.functions._
val dfWithRow = df.withColumn("rowNo", monotonically_increasing_id())
val maxIdToFilter = dfWithRow.filter(lower(col("PDP")) === "good").select(max("rowNo")).first().getLong(0)
dfWithRow.filter(col("rowNo") <= maxIdToFilter).drop("rowNo").show(false)