我有这张桌子:
假定“ florio”是AllLocationTerms数组列中某个位置的城市。
当“ florio”存在于AllLocationTerms数组列中的位置列表中时,该如何删除?
基本上,我想从“查询”列中删除AllLocationTerms中的所有匹配项。
可能会出现两个或两个以上的单词-作为查询的“纽约公寓”和数组中的“新”,“纽约”。在这种情况下,结果应该是“公寓”。
答案 0 :(得分:2)
以下是用于BigQuery标准SQL
#standardSQL
SELECT *,
REGEXP_REPLACE(query,
(SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
'') modified_query
FROM `project.dataset.table`
您可以使用以下虚拟数据进行测试,操作
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms
)
SELECT *,
REGEXP_REPLACE(query,
(SELECT CONCAT('\\b', STRING_AGG(term, '\\b|\\b'), '\\b') FROM UNNEST(allLocationTerms) term),
'') modified_query
FROM `project.dataset.table`
结果是
Row query clicks allLocationTerms modified_query
1 florio management apartments 1 battle management apartments
creek
iowa
florio
2 florio creek management iowa apartments 1 battle management apartments
creek
iowa
florio
答案 1 :(得分:1)
以下是针对您的用例的解决方案,当您需要针对包含约40k项的位置列表数组检查800,000行时
因此,像我以前的回答一样,40K项绝对不能用于构造正则表达式。
因此,为了解决这个问题,我建议将查询字符串分成几个单独的单词,以保留位置编号-然后通过左连接排除那些属于术语的单词,最后将剩余的单词组合回字符串
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'florio management apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms UNION ALL
SELECT 'florio creek management iowa apartments' query, 1 clicks, ['battle','creek','iowa','florio'] allLocationTerms
)
SELECT *,
(
SELECT STRING_AGG(word, ' ' ORDER BY pos)
FROM (
SELECT word, MIN(pos) pos
FROM UNNEST(SPLIT(query, ' ')) word WITH OFFSET AS pos
LEFT JOIN UNNEST(allLocationTerms) term
ON word = term
GROUP BY word
HAVING COUNT(DISTINCT term) = 0
)
) modified_query
FROM `project.dataset.table`