每20次迭代更新SQL中的行

时间:2017-09-20 07:20:25

标签: mysql sql google-bigquery

我在{Big} the NYC Yellow TaxiCab public dataset的Google BigQuery中有一个约100万行的表格。从该链接可以看出,架构没有主键。每行代表一次旅行/交易,但没有customer_id字段。

我想添加一列customer_id并向其分发随机数,以便:

For rows 1-20, `customer_id` should be assigned `1`
For rows 21-40, `customer_id` should be assigned `2`
and so on..

换句话说,我希望表中的确切(和任何)20行具有特定值customer_id

2 个答案:

答案 0 :(得分:2)

为每一行分配一个随机ID,为每个new_id获取约20行的组:

#standardSQL
SELECT CAST(FLOOR(COUNT(*) OVER()/20*RAND()) AS INT64) new_id, *
FROM (
  SELECT login
  FROM `ghtorrent-bq.ght_2017_04_01.users`
  LIMIT 1000000
)

证明超过一百万行时会生成50,000个“customers_ids”:

enter image description here

答案 1 :(得分:1)

下面是BigQuery Standard SQL,每个customer_id只生成20个条目

   
#standardSQL
SELECT DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id, *
FROM `yourTable`
-- ORDER BY customer_id

您可以使用虚拟数据测试此播放,如下所示

#standardSQL
WITH `yourTable` AS (
    SELECT login
    FROM `ghtorrent-bq.ght_2017_04_01.users`
    LIMIT 1000000
)
SELECT DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id, *
FROM `yourTable`
-- ORDER BY customer_id  

此外 - 以下查询显示每个customer_id的计数分布

#standardSQL
WITH `yourTable` AS (
    SELECT login
    FROM `ghtorrent-bq.ght_2017_04_01.users`
    LIMIT 1000000
)
SELECT cnt, COUNT(1) AS distribution FROM (
  SELECT customer_id, COUNT(1) AS cnt FROM (
    SELECT *, DIV(ROW_NUMBER() OVER() - 1, 20) AS customer_id
    FROM `yourTable`
    ORDER BY customer_id
  )
  GROUP BY customer_id
)
GROUP BY cnt
ORDER BY cnt   

输出如下

Row cnt distribution     
--- --- ------------
1    20        50000