如何使用大型数据集的连接和子查询优化MySQL查询(数百万行)

时间:2015-04-20 13:02:45

标签: mysql group-by subquery left-join large-data

我试图将国际专利数据库(PATSTAT)的四个大表(35-200百万行)加入到被引用最多的专利中的前15个,这些专利符合一些要求。

第一个表(t9)列出了从一个应用程序组(系列)到另一个应用程序的引用。 另一个表(t1)基本上将所有内容链接在一起,因为它包含系列和应用程序ID,以及归档年份 表格t2tls209_appln_ipc用于标识要包含的appln_id

我最终得到的代码如下:

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 
LEFT JOIN 
(SELECT
t1.appln_id, t1.docdb_family_id from tls201_appln t1
LEFT JOIN tls204_appln_prior t2 on t1.appln_id=t2.appln_id 
WHERE
t1.appln_filing_year BETWEEN 2010 AND 2015
AND
t2.appln_id IS NULL
AND
t1.appln_id IN (SELECT distinct appln_id from tls209_appln_ipc where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"))) t3 ON t9.cited_docdb_family_id=t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15

问题是在PATSTAT的在线基于Web的界面中运行的查询在会话超时之前没有收敛。有没有办法提高这个查询的效率?

CNC中
tls_209_appln_ipc包含1.95亿行appln_idipc_subclass_symbolappln_id可能在此表中出现零次或多次。在我的查询中,我只需要docdb_family_id,如果其链接appln_id任何链接到ipc_subclass_symbol任何我列出了。

4 个答案:

答案 0 :(得分:1)

这是您的查询:

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 LEFT JOIN 
     (SELECT t1.appln_id, t1.docdb_family_id
      from tls201_appln t1 LEFT JOIN
           tls204_appln_prior t2
           on t1.appln_id=t2.appln_id 
      WHERE t1.appln_filing_year BETWEEN 2010 AND 2015 AND
            t2.appln_id IS NULL AND
            t1.appln_id IN (SELECT distinct appln_id
                            from tls209_appln_ipc
                            where ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J"
                                                         )
                           )
           ) t3
      ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;

此查询具有优化范围。首先,应该谨慎使用MySQL中的子查询,因为子查询已实现。这里不需要子查询。您可以将left join操作链接起来。其次,select distinctin子查询中没用。此外,exists通常更快。

我首先将其重写为:

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t1.appln_id
FROM docdb_family_citation t9 LEFT JOIN 
     tls201_appln t1
     on t9.cited_docdb_family_id = t1.docdb_family_id and
        t1.appln_filing_year BETWEEN 2010 AND 2015 and
        exists (select 1 from tls209_appln_ipc t209
                where t209.appln_id = t1.appln_id AND
                      t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
               ) and
        not exists (select 1 from tls204_appln_prior t2
                    where t1.appln_id = t2.appln_id 
                   )
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;

对于此查询,您需要以下索引:tls204_appln_prior(appln_id)tls209_appln_ipc(appln_id, ipc_subclass_symbol)tls201_appln(cited_docdb_family_id, appln_id)

我不是exists子句中not existson的粉丝,但这似乎是您正在寻找的语义。我强烈怀疑有更好的方法来编写查询,但您的问题并没有提供足够的信息。更好的方法是首先聚合t1表,然后将left join聚合到t9表。但是,嵌套的left joinexists会让人感到困惑。

答案 1 :(得分:0)

我认为您创建了所需的索引,因此我将通过索引部分。

  • 使用views作为子查询或主查询并在后台更新它们是一个。这可能有助于超时问题导致您将使用视图进行选择,后台进程将运行您的慢查询。
  • appln_filing_year上的选项为range partitioning,可能是ipc_subclass_symbol上的列表分区...年份不会有问题,但是ipc_subclass_symbol,我不知道你在这个中有多少独特的数据,但你可以看看限制here。在您的情况下,分区将返回比平常更快的结果。
  • 您可以在my.cnf中增加mys wait_timeout或运行时间。如果你没有改变它,默认是28800。但我个人并不喜欢这个。

我希望这会有所帮助。

答案 2 :(得分:0)

我很想首先删除内部子查询,这可以在主子查询中作为JOIN完成,使用DISINCT删除否则会创建的重复项: -

SELECT t9.cited_docdb_family_id, COUNT(t9.cited_docdb_family_id) AS cited, t3.appln_id
FROM docdb_family_citation t9 
LEFT JOIN 
(
    SELECT DISTINCT t1.appln_id, t1.docdb_family_id 
    FROM tls201_appln t1
    INNER JOIN tls209_appln_ipc t99 ON t1.appln_id = t99.appln_id 
    LEFT JOIN tls204_appln_prior t2 ON t1.appln_id = t2.appln_id 
    WHERE t1.appln_filing_year BETWEEN 2010 AND 2015
    AND t2.appln_id IS NULL
    AND t1.appln_id IN 
    AND t99.ipc_subclass_symbol IN ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
) t3 
ON t9.cited_docdb_family_id = t3.docdb_family_id
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15

如果可以在tls201_appln表的多行上重复使用t1.appln_id,t1.docdb_family_id的组合,那么我建议也返回行唯一键(因此DISTINCT将返回不同的行而不是不同的值)。

答案 3 :(得分:0)

在前面的答案的帮助下,给出了我正在寻找的结果的最终代码:

SELECT t9.cited_docdb_family_id, t99.cited AS cited, t1.appln_id, t1.appln_nr_epodoc
        FROM docdb_family_citation t9 
INNER JOIN (SELECT cited_docdb_family_id, count(cited_docdb_family_id) as cited FROM docdb_family_citation GROUP BY cited_docdb_family_id) t99 
ON t9.cited_docdb_family_id = t99.cited_docdb_family_id
LEFT JOIN 
     tls201_appln t1
     on t9.cited_docdb_family_id = t1.docdb_family_id 
     WHERE
        t1.appln_filing_year BETWEEN 2010 AND 2015 and
        exists (select 1 from tls209_appln_ipc t209
                where t209.appln_id = t1.appln_id
                  and    t209.ipc_subclass_symbol in ("A61K", "C07K", "A61P", "Cl2N", "C07D", "Cl2P", "C07H", "C12Q", "C07J")
               ) and
        not exists (select 1 from tls204_appln_prior t2
                    where t1.appln_id = t2.appln_id 
                   )
GROUP BY t9.cited_docdb_family_id
ORDER BY cited DESC
LIMIT 15;`

请注意,使用子查询t99的连接用于获取正确的cited计数