使用github数据集在bigquery上加入02表

时间:2017-02-15 16:39:10

标签: google-bigquery github-api

我的目标是通过sample_contents

加入sample_commits repo_name

1)首先,我与sample_contents加入files,以便它现在包含repo_name

SELECT line,a.id,sample_path,sample_repo_name,repo_name
FROM (
   SELECT * FROM (
      SELECT (SPLIT(content, '\n')) line , a.id,sample_path,sample_repo_name,repo_name
          FROM (
               (SELECT * FROM [bigquery-public-data:github_repos.sample_contents]   WHERE sample_path LIKE '%.java' )              
               ) a JOIN  ( SELECT * FROM [bigquery-public-data:github_repos.files] ) b ON a.id=b.id
        ) WHERE REGEXP_MATCH(line, '^String|^private int|^public|[.]') 
  )

2)现在,我做了以下query期望通过repo_name获取任何给定文件的所有提交:

SELECT (CASE WHEN line CONTAINS 'String' THEN 'String' ELSE '' END) AS column_1,
(CASE WHEN line CONTAINS 'public' THEN 'public' ELSE '' END) AS column_2,line,a.id,sample_path,sample_repo_name,X.repo_name
FROM (
   SELECT * FROM (
      SELECT (SPLIT(content, '\n')) line , a.id,sample_path,sample_repo_name,repo_name
          FROM (
               (SELECT * FROM [bigquery-public-data:github_repos.sample_contents]   WHERE sample_path LIKE '%.java' )              
               ) a JOIN  ( SELECT * FROM [bigquery-public-data:github_repos.files] ) b ON a.id=b.id
        ) WHERE REGEXP_MATCH(line, '^String|^private int|^public|[.]') 
  )
  X JOIN (SELECT * FROM [bigquery-public-data:github_repos.sample_commits]) Y ON X.repo_name=Y.repo_name LIMIT 100

enter image description here 但它返回0结果! 你能帮我弄清楚这个问题吗?

谢谢,

1 个答案:

答案 0 :(得分:0)

我不确定你在这里要做什么,但是检查的好地方是“解释”标签。

enter image description here

请注意,在初始阶段,查询在第2阶段处理大约20亿行,在第7阶段处理大约17亿行。不知何故,在第37阶段将其转换为100亿行(爆炸JOIN或SPLIT()?)。

看看第37阶段会发生什么......因为没有来自这100亿的行通过过滤器(WHERE或等于JOIN):

enter image description here