Question

我在BigQuery中有以下基因组表（超过12K行）。 PIK3CA_features的长列表（第2列）与同一个sample_id（第1列）相关

Row sample_id   PIK3CA_features  
1   hu011C57    chr3_3930069__TGT    
2   hu011C57    chr3_3929921_TC  
3   hu011C57    chr3_3929739_TC  
4   hu011C57    chr3_3929813__T  
5   hu011C57    chr3_3929897_GA  
6   hu011C57    chr3_3929977_TC  
7   hu011C57    chr3_3929783_TC

我想生成下表：

Row sample_id   chr3_3930069__TGT   chr3_3929921_TC chr3_3929739_TC
1   hu011C57    1                   1               0
2   hu011C58    0

含义，每个样品ID一行，如果此样品中存在PIK3CA_feature，则为1/0。

知道如何轻松生成此表吗？

非常感谢任何想法！

Answer 1

想到的唯一想法是使用ARRAYS and STRUCTS的概念来接近你需要的东西，如下所示：

WITH data AS(
SELECT 'hu011C57' sample_id, 'chr3_3930069__TGT' PIK3CA_features union all
SELECT 'hu011C57', 'chr3_3929921_TC' union all
SELECT 'hu011C57', 'chr3_3929739_TC' union all
SELECT 'hu011C57', 'chr3_3929813__T' union all
SELECT 'hu011C57', 'chr3_3929897_GA' union all  
SELECT 'hu011C57', 'chr3_3929977_TC' union all
SELECT 'hu011C57', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929921_TC'
),

all_features AS (
  SELECT DISTINCT PIK3CA_features FROM data
),

aggregated_samples AS(
  SELECT
    sample_id,
    ARRAY_AGG(DISTINCT PIK3CA_features) features
FROM data
GROUP BY sample_id
)

SELECT 
  sample_id,
  ARRAY(SELECT AS STRUCT PIK3CA_features, PIK3CA_features IN (SELECT feature FROM UNNEST(features) feature) FROM all_features AS present ORDER BY PIK3CA_features) features
FROM aggregated_samples

这将为每个sample_id返回一行，并为每个要素及其在sample_id中的存在提供相应的结构数组。

由于BigQuery本身支持这种类型的数据结构，因此您可以对数据进行此表示，而不会丢失任何高级分析容量，例如使用分析函数，子查询等。

Answer 2

您可以通过对样本ID进行分组来实现此目的。

SELECT 
    sample_id,
    COUNTIF(PIK3CA_features = 'chr3_3930069__TGT') as chr3_3930069__TGT,
    COUNTIF(PIK3CA_features = 'chr3_3929921_TC') as chr3_3929921_TC,
    COUNTIF(PIK3CA_features = 'chr3_3929739_TC') as chr3_3929739_TC
FROM [your_table]
GROUP BY sample_id;

假设每个样本ID没有重复的PIK3CA_features，这应该可以满足您的需求。

bigQuery - 如何使用行值为新表创建列

2 个答案: