我已经在Google BigQuery中上传了99,628行。
该模式应具有company_name,电话,电子邮件,地址,城市,州等。
我想按company_name
仅保留大多数属性的不同行。
如果我的行为
Microsoft | 2355 |
Microsoft | 1234 | ms@example.com | seatle | XYZ | KC
Microsoft | 2355 | any@example.com
我想保留第二行,因为它具有最高的属性。
我尝试了以下查询,但只返回了不同的结果,而不是具有最高属性的结果。
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY company_name)
row_number
FROM `local-bastion-154121.Property_Dataset.pmDATA`
)
WHERE row_number = 1
答案 0 :(得分:1)
我将“具有最高属性”解释为特定company_name
的行中具有最大非NULL值的行。您应该能够执行以下操作:
CREATE TABLE dataset.new_table AS
SELECT
company_name,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (company_name))
ORDER BY ARRAY_LENGTH(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r': null'))
)[OFFSET(0)].*
FROM dataset.existing_table AS t
GROUP BY company_name
以示例数据为例:
WITH existing_table AS (
SELECT 'Microsoft' AS company_name, 2355 AS x, NULL AS email, NULL AS city, NULL AS y, NULL AS z UNION ALL
SELECT 'Microsoft', 1234, 'ms@example.com', 'seattle', 'XYZ', 'KC' UNION ALL
SELECT 'Microsoft', 2355, NULL, NULL, NULL, NULL
)
SELECT
company_name,
ARRAY_AGG(
(SELECT AS STRUCT t.* EXCEPT (company_name))
ORDER BY ARRAY_LENGTH(SPLIT(TO_JSON_STRING(t), ':null'))
)[OFFSET(0)].*
FROM existing_table AS t
GROUP BY company_name
使用此技巧并结合NULL
和SPLIT
使用TO_JSON_STRING
来计数column_name
值的好处是您不需要显式地编写其他列的列表。它的作用是构建一个NULL
以外的所有列的结构,并按升序按行中company_name
值的数量进行排序,这意味着您将获得填充次数最多的行,每个.head{
position: relative;
background: #FFF;
overflow-y: none;
overflow-x: hidden;
height: 20vh;
z-index: 1;
}
.nav-scroll {
position: relative;
overflow-y: none;
overflow-x: hidden;
height: 90vh;
float:left;
margin 0 0 0 0;
}
body, html {
margin: 0;
padding 0;
overflow-y: hidden;
width: 350px;
height:auto;
}
的值。
答案 1 :(得分:1)
我会通过引入每个字段的权重来考虑“具有最高属性”的翻译,例如,我希望拥有email
比city
,state
更重要一个场对我来说会超重两个场
以下内容适用于BigQuery Standard SQL,并尝试权衡使用方法
#standardSQL
WITH weights AS (
SELECT 'phone' field, 4 weight UNION ALL
SELECT 'email', 100 UNION ALL
SELECT 'city', 2 UNION ALL
SELECT 'address', 1 UNION ALL
SELECT 'state', 7
)
SELECT
ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT
ANY_VALUE(t) r,
SUM(weight) score
FROM `local-bastion-154121.Property_Dataset.pmDATA` t
CROSS JOIN weights w
WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name
您可以使用以下问题中的示例数据来测试,玩这个游戏
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
SELECT 'phone' field, 4 weight UNION ALL
SELECT 'email', 100 UNION ALL
SELECT 'city', 2 UNION ALL
SELECT 'address', 1 UNION ALL
SELECT 'state', 7
)
SELECT
ARRAY_AGG(r ORDER BY score DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT
ANY_VALUE(t) r,
SUM(weight) score
FROM `project.dataset.table` t
CROSS JOIN weights w
WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
GROUP BY TO_JSON_STRING(t)
)
GROUP BY r.company_name
有结果
Row company_name phone email city address state
1 Microsoft 2355 any@example.com null null null
正如您在此处看到的那样,优胜者的可用属性比其他行少,因为它具有更多的“有价值”属性
您可以在下面使用
查看得分#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Microsoft' company_name, 2355 phone, NULL email, NULL city, NULL address, NULL state UNION ALL
SELECT 'Microsoft', 1234, NULL, 'seattle', 'XYZ', 'KC' UNION ALL
SELECT 'Microsoft', 2355, 'any@example.com', NULL, NULL, NULL
), weights AS (
SELECT 'phone' field, 4 weight UNION ALL
SELECT 'email', 100 UNION ALL
SELECT 'city', 2 UNION ALL
SELECT 'address', 1 UNION ALL
SELECT 'state', 7
)
SELECT
ANY_VALUE(t).*,
SUM(weight) score
FROM `project.dataset.table` t
CROSS JOIN weights w
WHERE REGEXP_EXTRACT(TO_JSON_STRING(t), CONCAT(r'', field, '":"?(.*?)"?[,}]')) != 'null'
GROUP BY TO_JSON_STRING(t)
ORDER BY score DESC
所以分数是
Row company_name phone email city address state score
1 Microsoft 2355 any@example.com null null null 104
2 Microsoft 1234 null seattle XYZ KC 14
3 Microsoft 2355 null null null null 4
答案 2 :(得分:0)
您可以创建一个子查询,为每行计数填充的列数,然后进行排序:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY company_name ORDER BY columns_filled DESC)
row_number
FROM (
SELECT *,
IF(uppose !="", 1,0) + IF(company_name !="", 1,0) + IF(phone !="", 1,0) +
IF(email !="", 1,0) + IF(address !="", 1,0) + IF(city !="", 1,0) +
IF(state !="", 1,0) + <SAME FOR EACH FIELD> as columns_filled
FROM `local-bastion-154121.Property_Dataset.pmDATA`
)
)
WHERE row_number = 1
就这样:)