我们希望运行一个返回两个出现在多行中的单词短语的查询。因此对于例如取字符串“Data Ninja”。由于它出现在我们的数据集中的多个行中,因此查询应返回该行。查询应通过在数据集中的行中查询两个相邻的单词组合(形成短语)来查找数据集中所有行的所有此类短语。这两个相邻的单词组合应该来自我们加载到BigQuery的数据集
我们如何在Google BigQuery中编写此查询?
数据集只是一长串英文句子。
答案 0 :(得分:4)
好消息:BigQuery现在支持SPLIT()。检查https://stackoverflow.com/a/24172995/132438。
这是一个黑客,但我碰巧喜欢黑客攻击:)。
在目前的形式中,它仅适用于超过2个单词的句子,并且它仅提取6个第一对。你可以从这里扩展和测试。
尝试使用您的数据,然后向您报告。
SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000
结果可能包含NSFW字样。样本数据集来自尚未清理的在线社区"。如果您对某些单词敏感,请不要运行查询。
答案 1 :(得分:3)
一个非常有用的黑客,它激励我解决我的问题,谢谢。
我的数据是乘客和年龄的组合,其中年龄是一串数字:
adults ages
------ -------------
4 "53,67,65,68"
4 "44,45,69,65"
3 "20,21,20"
3 "30,32,62"
我想在每一行添加一列,其中包含最高和最低值之间的年龄差异
adults ages agediff
------ ------------- -------
4 "53,67,65,68" 15
4 "44,45,69,65" 25
3 "20,21,20" 1
3 "30,32,62" 32
这是由以下人员完成的,深受黑客的启发:
SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM
(SELECT adults, ages, max(age) maxage, min(age) minage FROM
(SELECT adults, ages, age FROM
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
),
(SELECT adults, ages, age FROM
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
(SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
)
)