查找数据集中多行中出现的所有两个单词短语

时间:2013-09-10 01:46:02

标签: data-mining bigdata google-bigquery data-analysis n-gram

我们希望运行一个返回两个出现在多行中的单词短语的查询。因此对于例如取字符串“Data Ninja”。由于它出现在我们的数据集中的多个行中,因此查询应返回该行。查询应通过在数据集中的行中查询两个相邻的单词组合(形成短语)来查找数据集中所有行的所有此类短语。这两个相邻的单词组合应该来自我们加载到BigQuery的数据集

我们如何在Google BigQuery中编写此查询?

数据集只是一长串英文句子。

2 个答案:

答案 0 :(得分:4)

好消息:BigQuery现在支持SPLIT()。检查https://stackoverflow.com/a/24172995/132438


这是一个黑客,但我碰巧喜欢黑客攻击:)。

在目前的形式中,它仅适用于超过2个单词的句子,并且它仅提取6个第一对。你可以从这里扩展和测试。

尝试使用您的数据,然后向您报告。

SELECT pairs, COUNT(*) c FROM
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){0}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){1}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){2}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){3}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){4}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
),
(
SELECT REGEXP_REPLACE(title, '([^\\s]+ ){5}([^\\s]* [^\\s]+).*', '\\2') pairs, title
FROM [bigquery-samples:reddit.full]
)
WHERE pairs != title
GROUP EACH BY pairs
HAVING c > 1
LIMIT 1000

结果可能包含NSFW字样。样本数据集来自尚未清理的在线社区"。如果您对某些单词敏感,请不要运行查询。

答案 1 :(得分:3)

一个非常有用的黑客,它激励我解决我的问题,谢谢。

我的数据是乘客和年龄的组合,其中年龄是一串数字:

adults ages
------ -------------
  4    "53,67,65,68"       
  4    "44,45,69,65" 
  3    "20,21,20"
  3    "30,32,62"

我想在每一行添加一列,其中包含最高和最低值之间的年龄差异

adults ages          agediff
------ ------------- -------
   4   "53,67,65,68" 15       
   4   "44,45,69,65" 25
   3   "20,21,20"    1
   3   "30,32,62"    32

这是由以下人员完成的,深受黑客的启发:

SELECT adults, ages, SUBTRACT(INTEGER(maxage),INTEGER(minage)) agediff FROM 
 (SELECT adults, ages, max(age) maxage, min(age) minage FROM
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="3"))
  ),
  (SELECT adults, ages, age FROM 
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4")),
   (SELECT adults, ages, REGEXP_EXTRACT(ages, r'\d\d\,\d\d\,\d\d\,([\d\d\,]{2})') age FROM [PaxAgeCombinations] WHERE (adults="4"))
  )