我意识到在Google大查询中有数百万种方法可以从dataset.table中获取架构....
有没有办法通过select语句获取架构数据?比如查询SQL服务器INFORMATION_SCHEMA表?
感谢。
答案 0 :(得分:6)
我需要执行数据分析,我唯一拥有的工具是webui上的QUERY函数。我想创建一个计算空值,非空值,字符串长度等每列
的查询
以下是为您提供探索和提升您需求的潜在方向/想法 它对于简单的模式来说效果相对较好 - 看起来需要针对具有记录和重复的模式进行调整 另外,请注意它跳过表中所有行中为NULL的列 - 因此下面的方法不会显示这些列
因此,将fh-bigquery.reddit.subreddits
作为一个简单的测试表:
#standardSQL
WITH `table` AS (
SELECT * FROM `fh-bigquery.reddit.subreddits`
),
table_as_json AS (
SELECT REGEXP_REPLACE(TO_JSON_STRING(t), r'^{|}$', '') AS row
FROM `table` AS t
),
pairs AS (
SELECT
REPLACE(column_name, '"', '') AS column_name,
IF(SAFE_CAST(column_value AS STRING)='null',NULL,column_value) AS column_value
FROM table_as_json, UNNEST(SPLIT(row, ',"')) AS z,
UNNEST([SPLIT(z, ':')[SAFE_OFFSET(0)]]) AS column_name,
UNNEST([SPLIT(z, ':')[SAFE_OFFSET(1)]]) AS column_value
)
SELECT
column_name,
COUNT(DISTINCT column_value) AS _distinct_values,
COUNTIF(column_value IS NULL) AS _nulls,
COUNTIF(column_value IS NOT NULL) AS _non_nulls,
MIN(LENGTH(SAFE_CAST(column_value AS STRING))) AS _min_length,
MAX(LENGTH(SAFE_CAST(column_value AS STRING))) AS _max_length,
ROUND(AVG(LENGTH(SAFE_CAST(column_value AS STRING)))) AS _avr_length
FROM pairs
WHERE column_name <> ''
GROUP BY column_name
ORDER BY column_name
结果是
column_name _nulls _non_nulls _min_length _max_length _avr_length
----------- ------ ---------- ----------- ----------- -----------
c_posts 0 2499 1 4 4.0
created_utc 0 2499 14 14 14.0
downs 0 2499 1 8 5.0
num_comments 0 2499 1 7 5.0
score 0 2499 1 7 5.0
subr 0 2499 4 23 12.0
ups 0 2499 1 8 5.0
我认为它非常接近所谓的剖析(并且在可用范围内) 您可以轻松添加任何列指标等。
我真的认为 - 这可能是你的好起点
答案 1 :(得分:1)
如果目标是计算每列的空值和非空值之类的信息,那么米哈伊尔的答案仍然有意义。不过,为了回答最初的问题,BigQuery提供了对INFORMATION_SCHEMA views的支持,在撰写本文时,它处于测试阶段。如果要获取表的架构,则可以查询COLUMNS
view,例如:
SELECT column_name, data_type
FROM `fh-bigquery`.reddit.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = 'subreddits'
ORDER BY ordinal_position
这将返回:
Row column_name data_type
1 subr STRING
2 created_utc TIMESTAMP
3 score INT64
4 num_comments INT64
5 c_posts INT64
6 ups INT64
7 downs INT64