根据数组中的匹配项从字符串中删除单词

时间:2018-07-03 22:48:11

标签: sql google-bigquery

我有这张桌子:

Big Query Table

假定“ florio”是AllLocationTerms数组列中某个位置的城市。

当“ florio”存在于AllLocationTerms数组列中的位置列表中时,该如何删除?

基本上,我想从“查询”列中删除AllLocationTerms中的所有匹配项。

可能会出现两个或两个以上的单词-作为查询的“纽约公寓”和数组中的“新”,“纽约”。在这种情况下,结果应该是“公寓”。

2 个答案:

答案 0 :(得分:2)

以下是用于BigQuery标准SQL

    
#standardSQL
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`   

您可以使用以下虚拟数据进行测试,操作

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *, 
  REGEXP_REPLACE(query, 
    (SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
  '') modified_query
FROM `project.dataset.table`    

结果是

Row query                              clicks   allLocationTerms    modified_query   
1   florio management apartments            1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       
2   florio creek management iowa apartments 1   battle              management apartments    
                                                creek        
                                                iowa         
                                                florio       

答案 1 :(得分:1)

以下是针对您的用例的解决方案,当您需要针对包含约40k项的位置列表数组检查800,000行时
因此,像我以前的回答一样,40K项绝对不能用于构造正则表达式。
因此,为了解决这个问题,我建议将查询字符串分成几个单独的单词,以保留位置编号-然后通过左连接排除那些属于术语的单词,最后将剩余的单词组合回字符串

    
#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
  SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms 
)
SELECT *,
  (
    SELECT STRING_AGG(word, ' ' ORDER BY pos) 
    FROM (
      SELECT word, MIN(pos) pos 
      FROM UNNEST(SPLIT(query, ' ')) word WITH OFFSET AS pos
      LEFT JOIN UNNEST(allLocationTerms) term 
      ON word = term
      GROUP BY word
      HAVING COUNT(DISTINCT term) = 0
    )
  ) modified_query
FROM `project.dataset.table`