我在BigQuery中有以下基因组表(超过12K行)。 PIK3CA_features的长列表(第2列)与同一个sample_id(第1列)相关
Row sample_id PIK3CA_features
1 hu011C57 chr3_3930069__TGT
2 hu011C57 chr3_3929921_TC
3 hu011C57 chr3_3929739_TC
4 hu011C57 chr3_3929813__T
5 hu011C57 chr3_3929897_GA
6 hu011C57 chr3_3929977_TC
7 hu011C57 chr3_3929783_TC
我想生成下表:
Row sample_id chr3_3930069__TGT chr3_3929921_TC chr3_3929739_TC
1 hu011C57 1 1 0
2 hu011C58 0
含义,每个样品ID一行,如果此样品中存在PIK3CA_feature,则为1/0。
知道如何轻松生成此表吗?
非常感谢任何想法!
答案 0 :(得分:1)
想到的唯一想法是使用ARRAYS and STRUCTS的概念来接近你需要的东西,如下所示:
WITH data AS(
SELECT 'hu011C57' sample_id, 'chr3_3930069__TGT' PIK3CA_features union all
SELECT 'hu011C57', 'chr3_3929921_TC' union all
SELECT 'hu011C57', 'chr3_3929739_TC' union all
SELECT 'hu011C57', 'chr3_3929813__T' union all
SELECT 'hu011C57', 'chr3_3929897_GA' union all
SELECT 'hu011C57', 'chr3_3929977_TC' union all
SELECT 'hu011C57', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929783_TC' union all
SELECT 'hu011C58', 'chr3_3929921_TC'
),
all_features AS (
SELECT DISTINCT PIK3CA_features FROM data
),
aggregated_samples AS(
SELECT
sample_id,
ARRAY_AGG(DISTINCT PIK3CA_features) features
FROM data
GROUP BY sample_id
)
SELECT
sample_id,
ARRAY(SELECT AS STRUCT PIK3CA_features, PIK3CA_features IN (SELECT feature FROM UNNEST(features) feature) FROM all_features AS present ORDER BY PIK3CA_features) features
FROM aggregated_samples
这将为每个sample_id
返回一行,并为每个要素及其在sample_id中的存在提供相应的结构数组。
由于BigQuery本身支持这种类型的数据结构,因此您可以对数据进行此表示,而不会丢失任何高级分析容量,例如使用分析函数,子查询等。
答案 1 :(得分:0)
您可以通过对样本ID进行分组来实现此目的。
SELECT
sample_id,
COUNTIF(PIK3CA_features = 'chr3_3930069__TGT') as chr3_3930069__TGT,
COUNTIF(PIK3CA_features = 'chr3_3929921_TC') as chr3_3929921_TC,
COUNTIF(PIK3CA_features = 'chr3_3929739_TC') as chr3_3929739_TC
FROM [your_table]
GROUP BY sample_id;
假设每个样本ID没有重复的PIK3CA_features,这应该可以满足您的需求。