bigQuery - 如何使用行值为新表创建列

时间:2017-06-15 00:54:54

标签: sql google-bigquery google-cloud-sql

我在BigQuery中有以下基因组表(超过12K行)。 PIK3CA_features的长列表(第2列)与同一个sample_id(第1列)相关

Row sample_id   PIK3CA_features  
1   hu011C57    chr3_3930069__TGT    
2   hu011C57    chr3_3929921_TC  
3   hu011C57    chr3_3929739_TC  
4   hu011C57    chr3_3929813__T  
5   hu011C57    chr3_3929897_GA  
6   hu011C57    chr3_3929977_TC  
7   hu011C57    chr3_3929783_TC  

我想生成下表:

Row sample_id   chr3_3930069__TGT   chr3_3929921_TC chr3_3929739_TC
1   hu011C57    1                   1               0
2   hu011C58    0    

含义,每个样品ID一行,如果此样品中存在PIK3CA_feature,则为1/0。

知道如何轻松生成此表吗?

非常感谢任何想法!

2 个答案:

答案 0 :(得分:1)

想到的唯一想法是使用ARRAYS and STRUCTS的概念来接近你需要的东西,如下所示:

WITH data AS(
SELECT 'hu011C57' sample_id, 'chr3_3930069__TGT' PIK3CA_features union all
SELECT 'hu011C57', 'chr3_3929921_TC' union all
SELECT 'hu011C57', 'chr3_3929739_TC' union all
SELECT 'hu011C57', 'chr3_3929813__T' union all
SELECT 'hu011C57', 'chr3_3929897_GA' union all  
SELECT 'hu011C57', 'chr3_3929977_TC' union all
SELECT 'hu011C57', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929921_TC'
),

all_features AS (
  SELECT DISTINCT PIK3CA_features FROM data
),

aggregated_samples AS(
  SELECT
    sample_id,
    ARRAY_AGG(DISTINCT PIK3CA_features) features
FROM data
GROUP BY sample_id
)

SELECT 
  sample_id,
  ARRAY(SELECT AS STRUCT PIK3CA_features, PIK3CA_features IN (SELECT feature FROM UNNEST(features) feature) FROM all_features AS present ORDER BY PIK3CA_features) features
FROM aggregated_samples  

这将为每个sample_id返回一行,并为每个要素及其在sample_id中的存在提供相应的结构数组。

由于BigQuery本身支持这种类型的数据结构,因此您可以对数据进行此表示,而不会丢失任何高级分析容量,例如使用分析函数,子查询等。

答案 1 :(得分:0)

您可以通过对样本ID进行分组来实现此目的。

SELECT 
    sample_id,
    COUNTIF(PIK3CA_features = 'chr3_3930069__TGT') as chr3_3930069__TGT,
    COUNTIF(PIK3CA_features = 'chr3_3929921_TC') as chr3_3929921_TC,
    COUNTIF(PIK3CA_features = 'chr3_3929739_TC') as chr3_3929739_TC
FROM [your_table]
GROUP BY sample_id;

假设每个样本ID没有重复的PIK3CA_features,这应该可以满足您的需求。