BigQuery找到子序列

时间:2017-12-14 17:42:52

标签: sql google-bigquery subsequence

假设我的桌子是

WITH `sample_project.sample_dataset.table` AS (
  SELECT 'user1' user, 2 sequence, 'T1' ts UNION ALL
  SELECT 'user1', 2, 'T2' UNION ALL
  SELECT 'user1', 1, 'T3' UNION ALL
  SELECT 'user1', 1, 'T4' UNION ALL
  SELECT 'user1', 3, 'T5' UNION ALL
  SELECT 'user1', 2, 'T6' UNION ALL
  SELECT 'user1', 3, 'T7' UNION ALL
  SELECT 'user1', 3, 'T8' 
)

我是否可以在不使用STRING_AGG和REGEX OR JOIN操作的情况下找到序列列中可用的整数子序列?这是为了提高查询效率。

子序列是String的一部分。例如,考虑String" banana",一个样本子序列是" anna"作为" anna"的每个索引字符从香蕉严格增加。子序列中的字符不需要是连续的。

按时间戳顺序(增加)说明上面的表格,我会得到序列列的STRING_AGG为22113233.在字符串22113233子序列1 2 3可用,而子序列3 2 1 可用。给定一个子序列213,我怎么能说这个子序列是否可用(在22113233中按时间戳排序)?

1 个答案:

答案 0 :(得分:3)

  

给定一个子序列213,我怎么能说这个子序列是否可用(在22113233 ...

以下示例适用于BigQuery SQL

   
#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list 
), `subsequenses` AS (
  SELECT '123' subsequence UNION ALL
  SELECT '321' UNION ALL
  SELECT '213'
)
SELECT sequence_list, subsequence, 
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE(subsequence, '', '.*')) available
FROM `sequences` l
CROSS JOIN `subsequenses` s   

结果如下

sequence_list   subsequence     available    
22113233        321             false    
22113233        123             true     
22113233        213             true     

如果您正在寻找特定的子序列 - 这可以进一步简化为

#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list UNION ALL
  SELECT '11223322'
)
SELECT sequence_list,  
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE('213', '', '.*')) available
FROM `sequences`

结果为

sequence_list   available    
22113233        true     
11223322        false