经过一些操作,我最终得到了GBQ中的一个表,该表列出了在区块链上进行的所有交易(约2.8亿行):
+-------+-------------------------+--------+-------+----------+
| Linha | timestamp | sender | value | receiver |
+-------+-------------------------+--------+-------+----------+
| 1 | 2018-06-28 01:31:00 UTC | User1 | 1.67 | User2 |
| 2 | 2017-04-06 00:47:29 UTC | User3 | 0.02 | User4 |
| 3 | 2013-11-27 13:22:05 UTC | User5 | 0.25 | User6 |
+-------+-------------------------+--------+-------+----------+
由于此表包含所有交易,因此,如果我汇总到给定日期的每个用户的所有值,则我可能会有他的余额,并且一旦我有近2200万用户,我想将其二值化他们有硬币。我使用以下代码浏览了所有数据集:
#standardSQL
SELECT
COUNT(val) AS num,
bin
FROM (
SELECT
val,
CASE
WHEN val > 0 AND val <= 1 THEN '0_to_1'
WHEN val > 1
AND val <= 10 THEN '1_to_10'
WHEN val > 10 AND val <= 100 THEN '10_to_100'
WHEN val > 100
AND val <= 1000 THEN '100_to_1000'
WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
WHEN val > 10000 THEN 'More_10000'
END AS bin
FROM (
SELECT
max(timestamp),
receiver,
SUM(value) as val
FROM
`table.transactions`
WHERE
timestamp < '2011-02-12 00:00:00'
group by
receiver))
GROUP BY
bin
哪个给了我类似的东西
+-------+-------+---------------+
| Linha | num | bin |
+-------+-------+---------------+
| 1 | 11518 | 1_to_10 |
| 2 | 9503 | 100_to_1000 |
| 3 | 18070 | 10_to_100 |
| 4 | 20275 | 0_to_1 |
| 5 | 1781 | 1000_to_10000 |
| 6 | 158 | More_10000 |
+-------+-------+---------------+
现在,我想遍历交易表的每一行,每天结束时检查每个bin中的用户数。决赛桌应该是这样的:
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| timestamp | 0_to_1 | 1_to_10 | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1 | 1 | 0 | 0 | 0 | 0 |
| 2009-01-10 00:00:00 UTC | 0 | 2 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315 | 234523555 | 2352355556 | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
由于数据集太大,我无法按时间戳排序使生活更轻松,因此,我希望您能提出一些想法。我想知道是否有某种方法可以例如通过分页来提高性能并节省资源。我听说过它,但是不知道如何使用它。
谢谢!
更新:经过一些工作,现在我确实有一个按时间戳排序的事务表。
答案 0 :(得分:1)
以下查询应按时间戳为您提供每个bin中的事务计数。现在,请记住,此查询将在行级别评估事务的值。
SELECT
timestamp,
COUNT(DISTINCT(CASE
WHEN value > 0 AND value <= 1 THEN receiver
END)) AS _0_to_1,
COUNT(DISTINCT(CASE
WHEN value > 1 AND value <= 10 THEN receiver
END)) AS _1_to_10,
COUNT(DISTINCT(CASE
WHEN value > 10 AND value <= 100 THEN receiver
END)) AS _10_to_100,
COUNT(DISTINCT(CASE
WHEN value > 100 AND value <= 1000 THEN receiver
END)) AS _100_to_1000,
COUNT(DISTINCT(CASE
WHEN value > 1000 AND value <= 10000 THEN receiver
END)) AS _1000_to_100000,
COUNT(DISTINCT(CASE
WHEN value > 10000 THEN receiver
END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1
关于性能问题,您可能想探索(如果可能)的一个方面是创建此大表的分区版本。这将帮助您1)改善性能,并2)降低查询特定数据范围的数据的成本。您可以找到更多信息here
编辑
我在查询中添加了WHERE
子句以过滤前一天。我假设您将运行查询,例如今天,以获取前一天的数据。现在,您可能需要通过添加其他CURRENT_TIMESTAMP()
或TIMESTAMP_SUB(...., INTERVAL X HOUR
来调整TIMESTAMP_ADD(...., INTERVAL X HOUR
到您的时区,其中X是需要减去或增加以匹配时区的小时数您正在分析的数据。
此外,您可能需要CAST(timestamp AS TIMESTAMP)
,具体取决于字段的类型。