如何在bigquery中做子查询?

时间:2016-02-22 13:32:47

标签: sql subquery google-bigquery reddit bigdata

我试图在bigquery上使用reddit数据,我希望在一行中看到评论和回复。我看到bigquery支持子查询,但我无法构造查询。由于数据的结构,我必须使用子查询自己加入同一个表,特别是我想将id和parent_id连接在一起,但我需要在加入之前修改id。以下是我尝试进行查询的方式:

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p,
(
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
WHERE  p.first_id = c.parent_id  
AND p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]';

我得到的错误是:

Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?

您可以在此处运行查询: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01

我不知道如何解决这个问题。加入这个并运行此查询的正确方法是什么?

2 个答案:

答案 0 :(得分:3)

你可能想做类似下面的事情(只是猜测):

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.link_id = c.parent_id  
WHERE p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]'
LIMIT 100

详细了解JOIN s

请注意,我刚刚将您的查询转换为正确使用JOIN,但查询逻辑仍然可供您根据需要进行修饰

  

已添加以在评论中提供其他信息:

SELECT 
  subreddit, 
  first_body,
  first_score,
  first_id ,
  last_body,
  last_score,
  last_id 
FROM (
  SELECT 
    subreddit, 
    body AS first_body,
    score AS first_score,
    CONCAT('t1_',id) AS first_id 
  FROM [fh-bigquery:reddit_comments.2016_01]
  WHERE score > 1 
  AND author != '[deleted]' 
  AND body != '[deleted]'
) p
JOIN (
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
ON  p.first_id = c.parent_id  
LIMIT 100  

答案 1 :(得分:0)

在BigQuery的SQL方言中,逗号表示UNION ALL而不是JOIN。您需要使用JOIN关键字显式编写JOIN。

我还建议将连接的两端都推送到子查询中,以确保在执行连接之前应用所有过滤器。 (连接是目前查询中最昂贵的部分,因此首先应用过滤器将确保您的查询尽可能快地运行。)