Google BigQuery在交叉连接中嵌套了选择子查询

时间:2015-05-20 11:11:57

标签: subquery google-bigquery cross-join

我有以下代码:

data dummy ;
input  A $ B $ C $ D $ v1 v2 v3 v4 ;
cards ;
ab ba cf dm 1 2 3 4 
ab bc cf dm 5 6 7 8
ab bc cf dm 1 2 3 4
ab bc cg dm 9 0 1 2
ac bd cg dm 3 4 5 6
;run ;

%macro lup;
proc sql noprint;
  select distinct compress(a!!"_"!!b!!"_"!!c!!"_"!!d) into :dataset1-:dataset99999
  from dummy;
quit;
%put &sqlobs;
data
  %do i=1 %to &sqlobs;
    &&dataset&i
  %end;
  ;
  set dummy;
  %do i=1 %to &sqlobs;
    if compress(a!!"_"!!b!!"_"!!c!!"_"!!d)="&&dataset&i" then output &&dataset&i;
  %end;
run;
%mend;
%lup;

给出以下错误:
SELECT ta.application as koekkoek, ta.ipc, ipc_count/ipc_tot as ipc_share, t3.sfields FROM ( select t1.appln_id as application, t1.ipc_subclass_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot FROM temp.tls209_small t1 CROSS JOIN (SELECT appln_id, FROM temp.tls209_small group by appln_id ) t2 where t1.appln_id = t2.appln_id GROUP BY application, ipc ) as ta CROSS JOIN thesis.ifris_ipc_concordance t3 WHERE ta.ipc LIKE t3.ipc+'%' AND ta.ipc NOT LIKE t3.not_ipc+'%' AND t3.not_appln_id NOT IN (SELECT ipc_subclass_symbol from temp.tls209_small t5 where t5.appln_id = ta.application)

我已尝试过该字段的多种表示法,但BigQuery似乎没有识别出对子查询中其他表的任何引用。

代码的目的是根据一致性表将新技术分类分配给记录:

我有两张桌子: 一个包含应用程序ID,分类和其他一些内容的大表Field 'ta.application' not found.
tls209_small

包含一些例外规则tls209_small的索引表: ifris_ipc_concordance

最后,我需要为ifris_ipc_concordance(3亿行)中的每一行分配sfields标签。规则是第一个表中的tls209应该与第二个表中的ipc_class_symbol+'%'类似,但不像ipc。 此外,not_ipc值(如果存在)不应与第一个表中的相同appln_id相关联。

这是一个小例子,说这是查询的输入:

not_appln_id

appln_id 1应该得到两次sfields X因为ipc = A,not_ipc匹配A1和A3。 在appln_id 1中出现A3时,不应该分配Y。

在结果中,我还需要单个应用程序appln_id | ipc_class_symbol 1 | A1 1 | A2 1 | A3 1 | C3 sfields | ipc | not_ipc | not_appln_id X | A | A2 | null Y | A | null | A3 的份额(1表示328100001,0.5表示32100009等)。

没有最后一个条件(ipc_class_symbol),查询工作正常: results

有关如何让子查询识别应用程序ID(ta.application)的任何建议,或其他将最后一个条件引入查询的方法?

我意识到我对问题的解释可能不是很简单,所以如果有任何不清楚的地方请说清楚,我会尽力澄清问题。

2 个答案:

答案 0 :(得分:1)

您正在执行的查询是进行反加入。您可以将其重写为显式连接,但它有点冗长:

UpLinearLayout up = new UpLinearLayout(context, 65);
up.anotherButton(context);

答案 1 :(得分:1)

通过首先生成一个表,我只匹配第一个表中的ipc_class_symbol到第二个表的ipc列,但也包括not_ipc来实现该问题的有效解决方案。来自第二个的{}和not_appln_id列。此外,使用GROUP_CONCAT方法添加了分配给每个appln_id的所有ipc类标签的列表。

最后,在Pentium10的帮助下,生成的表格已根据例外规则进行过滤,这也在this question中进行了讨论。

在最终查询中,GROUP BY和JOIN参数需要EACH修饰符以允许处理大表:

SELECT application as appln_id, ipc as ipc_class, ipc_share, sfields as ifris_class FROM (
  SELECT * FROM (
    SELECT ta.application as application, ta.ipc as ipc, ipc_count/ipc_tot as ipc_share, t3.sfields as sfields, t3.ipc as yes_ipc, t3.not_ipc as not_ipc, t3.not_appln_id as exclude, t4.classes as other_classes FROM (
      SELECT t1.appln_id as application, t1.ipc_class_symbol as ipc, count(t2.appln_id) as ipc_count, sum(ipc_count) over (PARTITION BY application) as ipc_tot
        FROM thesis.tls209_appln_ipc t1

        FULL OUTER JOIN EACH
          (SELECT appln_id, FROM thesis.tls209_appln_ipc GROUP EACH BY appln_id )  t2 
            ON t1.appln_id = t2.appln_id
             GROUP EACH BY application, ipc
          ) AS ta


        LEFT JOIN EACH (
          SELECT appln_id, GROUP_CONCAT(ipc_class_symbol) as classes FROM [thesis.tls209_appln_ipc] 
            GROUP EACH BY appln_id) t4
        ON ta.application = t4.appln_id

        CROSS JOIN  thesis.ifris_ipc_concordance t3
        WHERE ta.ipc CONTAINS t3.ipc
  ) as tx
   WHERE (not ipc contains not_ipc or not_ipc is null) 
   AND (not other_classes contains exclude or exclude is null or other_classes is null)
)