为区块链交易创建每日垃圾箱

时间:2018-12-07 04:24:05

标签: google-bigquery

经过一些操作,我最终得到了GBQ中的一个表,该表列出了在区块链上进行的所有交易(约2.8亿行):

+-------+-------------------------+--------+-------+----------+
| Linha |           timestamp     | sender | value | receiver |
+-------+-------------------------+--------+-------+----------+
|     1 | 2018-06-28 01:31:00 UTC | User1  | 1.67  | User2    |
|     2 | 2017-04-06 00:47:29 UTC | User3  | 0.02  | User4    |
|     3 | 2013-11-27 13:22:05 UTC | User5  | 0.25  | User6    |
+-------+-------------------------+--------+-------+----------+

由于此表包含所有交易,因此,如果我汇总到给定日期的每个用户的所有值,则我可能会有他的余额,并且一旦我有近2200万用户,我想将其二值化他们有硬币。我使用以下代码浏览了所有数据集:

#standardSQL
SELECT
  COUNT(val) AS num,
  bin
FROM (
  SELECT
    val,
    CASE
      WHEN val > 0 AND val <= 1 THEN '0_to_1'
      WHEN val > 1
    AND val <= 10 THEN '1_to_10'
      WHEN val > 10 AND val <= 100 THEN '10_to_100'
      WHEN val > 100
    AND val <= 1000 THEN '100_to_1000'
      WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
      WHEN val > 10000 THEN 'More_10000'
    END AS bin
  FROM (
    SELECT
        max(timestamp),
        receiver,
        SUM(value) as val
      FROM
        `table.transactions`
      WHERE
        timestamp < '2011-02-12 00:00:00'
      group by
        receiver))
GROUP BY
  bin

哪个给了我类似的东西

+-------+-------+---------------+
| Linha |  num  |      bin      |
+-------+-------+---------------+
|     1 | 11518 | 1_to_10       |
|     2 |  9503 | 100_to_1000   |
|     3 | 18070 | 10_to_100     |
|     4 | 20275 | 0_to_1        |
|     5 |  1781 | 1000_to_10000 |
|     6 |   158 | More_10000    |
+-------+-------+---------------+

现在,我想遍历交易表的每一行,每天结束时检查每个bin中的用户数。决赛桌应该是这样的:

+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
|           timestamp     | 0_to_1  |  1_to_10  | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1       | 1         | 0         | 0           | 0             | 0          |
| 2009-01-10 00:00:00 UTC | 0       | 2         | 0         | 0           | 0             | 0          |
| ...                     | ...     | ...       | ...       | ...         | ...           | ...        |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315  | 234523555   | 2352355556    | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+

由于数据集太大,我无法按时间戳排序使生活更轻松,因此,我希望您能提出一些想法。我想知道是否有某种方法可以例如通过分页来提高性能并节省资源。我听说过它,但是不知道如何使用它。

谢谢!


更新:经过一些工作,现在我确实有一个按时间戳排序的事务表。

1 个答案:

答案 0 :(得分:1)

以下查询应按时间戳为您提供每个bin中的事务计数。现在,请记住,此查询将在行级别评估事务的值。

SELECT
  timestamp,
    COUNT(DISTINCT(CASE
      WHEN value > 0 AND value <= 1 THEN receiver
    END))  AS _0_to_1,
    COUNT(DISTINCT(CASE
      WHEN value > 1 AND value <= 10 THEN receiver
    END)) AS _1_to_10,
    COUNT(DISTINCT(CASE
      WHEN value > 10 AND value <= 100 THEN receiver
    END)) AS _10_to_100,
    COUNT(DISTINCT(CASE
      WHEN value > 100 AND value <= 1000 THEN receiver
    END)) AS _100_to_1000,
    COUNT(DISTINCT(CASE
      WHEN value > 1000 AND value <= 10000 THEN receiver
    END)) AS _1000_to_100000,
    COUNT(DISTINCT(CASE
      WHEN value > 10000 THEN receiver
    END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1

关于性能问题,您可能想探索(如果可能)的一个方面是创建此大表的分区版本。这将帮助您1)改善性能,并2)降低查询特定数据范围的数据的成本。您可以找到更多信息here

编辑

我在查询中添加了WHERE子句以过滤前一天。我假设您将运行查询,例如今天,以获取前一天的数据。现在,您可能需要通过添加其他CURRENT_TIMESTAMP()TIMESTAMP_SUB(...., INTERVAL X HOUR来调整TIMESTAMP_ADD(...., INTERVAL X HOUR到您的时区,其中X是需要减去或增加以匹配时区的小时数您正在分析的数据。

此外,您可能需要CAST(timestamp AS TIMESTAMP),具体取决于字段的类型。