我们如何从BigQuery中删除重复的数据并将其保存到另一个具有很多属性的表中

时间:2019-01-03 13:25:40

标签: sql google-bigquery

我已经在Google BigQuery中上传了99,628行。 该模式应具有company_name,电话,电子邮件,地址,城市,州等。 我想按company_name仅保留大多数属性的不同行。 如果我的行为

Microsoft | 2355 |

Microsoft | 1234 | ms@example.com | seatle | XYZ | KC

Microsoft | 2355 | any@example.com

我想保留第二行,因为它具有最高的属性。

我尝试了以下查询,但只返回了不同的结果,而不是具有最高属性的结果。

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
      OVER (PARTITION BY company_name)
      row_number
  FROM `local-bastion-154121.Property_Dataset.pmDATA`
)
WHERE row_number = 1

3 个答案:

答案 0 :(得分:1)

我将“具有最高属性”解释为特定company_name的行中具有最大非NULL值的行。您应该能够执行以下操作:

CREATE TABLE dataset.new_table AS
SELECT
  company_name,
  ARRAY_AGG(
    (SELECT AS STRUCT t.* EXCEPT (company_name))
    ORDER BY ARRAY_LENGTH(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r': null'))
  )[OFFSET(0)].*
FROM dataset.existing_table AS t
GROUP BY company_name

以示例数据为例:

WITH existing_table AS (
  SELECT 'Microsoft' AS company_name, 2355 AS x, NULL AS email, NULL AS city, NULL AS y, NULL AS z UNION ALL
  SELECT 'Microsoft', 1234, 'ms@example.com', 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, NULL, NULL, NULL, NULL
)
SELECT
  company_name,
  ARRAY_AGG(
    (SELECT AS STRUCT t.* EXCEPT (company_name))
    ORDER BY ARRAY_LENGTH(SPLIT(TO_JSON_STRING(t), ':null'))
  )[OFFSET(0)].*
FROM existing_table AS t
GROUP BY company_name

使用此技巧并结合NULLSPLIT使用TO_JSON_STRING来计数column_name值的好处是您不需要显式地编写其他列的列表。它的作用是构建一个NULL以外的所有列的结构,并按升序按行中company_name值的数量进行排序,这意味着您将获得填充次数最多的行,每个.head{ position: relative; background: #FFF; overflow-y: none; overflow-x: hidden; height: 20vh; z-index: 1; } .nav-scroll { position: relative; overflow-y: none; overflow-x: hidden; height: 90vh; float:left; margin 0 0 0 0; } body, html { margin: 0; padding 0; overflow-y: hidden; width: 350px; height:auto; }的值。

答案 1 :(得分:1)

我会通过引入每个字段的权重来考虑“具有最高属性”的翻译,例如,我希望拥有emailcitystate更重要一个场对我来说会超重两个场

以下内容适用于BigQuery Standard SQL,并尝试权衡使用方法

#standardSQL
WITH weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT
  ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    ANY_VALUE(t) r,
    SUM(weight) score
  FROM `local-bastion-154121.Property_Dataset.pmDATA` t
  CROSS JOIN weights w 
  WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
  GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name    

您可以使用以下问题中的示例数据来测试,玩这个游戏

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
  SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT
  ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
  SELECT 
    ANY_VALUE(t) r,
    SUM(weight) score
  FROM `project.dataset.table` t
  CROSS JOIN weights w 
  WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
  GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name   

有结果

Row company_name    phone   email           city    address state    
1   Microsoft       2355    any@example.com null    null    null      

正如您在此处看到的那样,优胜者的可用属性比其他行少,因为它具有更多的“有价值”属性

您可以在下面使用

查看得分
#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
  SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
  SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
  SELECT 'phone' field, 4 weight UNION ALL
  SELECT 'email', 100 UNION ALL
  SELECT 'city', 2 UNION ALL
  SELECT 'address', 1 UNION ALL
  SELECT 'state', 7
)
SELECT 
  ANY_VALUE(t).*,
  SUM(weight) score
FROM `project.dataset.table` t
CROSS JOIN weights w 
WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
GROUP BY TO_JSON_STRING(t)
ORDER BY score DESC

所以分数是

Row company_name    phone   email           city    address state   score   
1   Microsoft       2355    any@example.com null    null    null    104  
2   Microsoft       1234    null            seattle XYZ     KC      14   
3   Microsoft       2355    null            null    null    null    4    

答案 2 :(得分:0)

您可以创建一个子查询,为每行计数填充的列数,然后进行排序:

SELECT *
FROM (
  SELECT
      *,
      ROW_NUMBER()
          OVER (PARTITION BY company_name ORDER BY columns_filled DESC)
          row_number
  FROM (
        SELECT *, 
        IF(uppose !="", 1,0) + IF(company_name !="", 1,0) + IF(phone !="", 1,0) + 
        IF(email !="", 1,0) + IF(address !="", 1,0) + IF(city !="", 1,0) + 
        IF(state !="", 1,0) + <SAME FOR EACH FIELD> as columns_filled
        FROM `local-bastion-154121.Property_Dataset.pmDATA`
   )
)
WHERE row_number = 1

就这样:)