我目前正在开展涉及专利的项目拉动USPTO网站,作为该项目的一部分,我正在使用伊利诺伊大学人员创建的数据库。
(论文:http://abel.lis.illinois.edu/UPDC/USPTOPatentsDatabaseConstruction.pdf)
(我正在使用的表格略微过时,只缺少非索引/键值:http://i.imgur.com/44LHS3L.png)
现在标题显示我正在尝试优化查询:
SELECT
PN,
AN,
grants.GrantID,
grants.FileDate,
grants.IssueDate,
grants.Kind,
grants.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
(
SELECT grants.GrantID as CitingID, CitedID as PN, grants2.ApplicationID AS AN
FROM
gracit_g, grants, grants as grants2
Where
grants.GrantID IN (*A
couple
Thousand
keys*)
and grants.GrantID = gracit_g.GrantID and grants2.GrantID = CitedID
LIMIT 500000) tbl1,
grants, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
grants.GrantID = tbl1.CitingID
and grants.GrantID = assignee_g.GrantID
and grants.GrantID = inventor_g.GrantID
and grants.GrantID = usclass_g.GrantID
and grants.GrantID = intclass_g.GrantID
GROUP BY PN, GrantID
LIMIT 50000000
几乎每个专利被引用的人都引用了我想记录引用它的专利信息。我似乎遇到的问题是我的“GROUP BY PN,GrantID”导致“使用Temporary,使用Filesort”,这严重减缓了我的努力。
这是我的解释给我的(对不起,如果格式不完整,我找不到如何制作表格)
1
PRIMARY
Derived2的
ALL
8716
可能的关键:null
key:null
key_len:null
ref:null
使用临时;使用filesort
1
PRIMARY
补助
eq_ref
PRIMARY
PRIMARY
62个
tbl1.CitingID
1
1
PRIMARY
assignee_g
REF
PRIMARY,FK_PublicationID_PUBLICATION_ASSIGNEE_P
PRIMARY
62个
tbl1.CitingID
1
1
PRIMARY
intclass_g
REF
PRIMARY,fk_publicationid_PUBLICATION_INTERNATIONALCLASS_P
PRIMARY
62个
tbl1.CitingID
1
1
PRIMARY
inventor_g
REF
PRIMARY,fk_PublicationID_Inventor_p
PRIMARY
62个
tbl1.CitingID
1
1
PRIMARY
usclass_g
REF
PRIMARY,fk_publicationid_PUBLICATION_USCLASS_P
PRIMARY
62个
tbl1.CitingID
2
2
衍生
补助
范围
PRIMARY
PRIMARY
62个
ref:null
2179
用在哪里;使用索引
2
衍生
gracit_g
REF
PRIMARY,FK_PublicationID_PUBLICATION_PCITATION_P,CitedID
PRIMARY
62个
uspto_patents.grants.GrantID
4
使用何处
2
衍生
grants2
eq_ref
PRIMARY
PRIMARY
62个
uspto_patents.gracit_g.CitedID
1
gracit_g的SHOW CREATE是:
CREATE TABLE `gracit_g` (
`GrantID` varchar(20) NOT NULL,
`Position` int(11) NOT NULL,
`CitedID` varchar(20) DEFAULT NULL,
`Kind` varchar(10) DEFAULT NULL COMMENT 'identify whether citedDoc is a document or foreign patent',
`Name` varchar(100) DEFAULT NULL,
`Date` date DEFAULT NULL,
`Country` varchar(100) DEFAULT NULL,
`Category` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`,`Position`),
KEY `FK_PublicationID_PUBLICATION_PCITATION_P` (`GrantID`),
KEY `CitedID` (`CitedID`),
CONSTRAINT `FK_GrantID_GRANT_PCITATION_G0` FOREIGN KEY (`GrantID`) REFERENCES `grants` (`GrantID`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8
赞助的SHOW CREATE是:
CREATE TABLE `grants` (
`GrantID` varchar(20) NOT NULL,
`Title` varchar(500) DEFAULT NULL,
`IssueDate` date DEFAULT NULL,
`Kind` varchar(2) DEFAULT NULL,
`USSeriesCode` varchar(2) DEFAULT NULL,
`Abstract` text,
`ClaimsNum` int(11) DEFAULT NULL,
`DrawingsNum` int(11) DEFAULT NULL,
`FiguresNum` int(11) DEFAULT NULL,
`ApplicationID` varchar(20) NOT NULL,
`Claims` text,
`FileDate` date DEFAULT NULL,
`AppType` varchar(45) DEFAULT NULL,
`AppNoOrig` varchar(10) DEFAULT NULL,
`SourceName` varchar(100) DEFAULT NULL,
PRIMARY KEY (`GrantID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
非常感谢您的时间,不幸的是,我必须退休到我的床上,因为现在为时尚早,或者为时尚早,我现在继续工作了。
一个建议是将其更改为1个查询,而不是使用子查询:
SELECT
gracit_g.citedID,
info_grant.GrantID,
info_grant.FileDate,
info_grant.IssueDate,
info_grant.Kind,
info_grant.ApplicationID,
assignee_g.OrgName,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';') as Assignee,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.FirstName, inventor_g.LastName) separator ';') as Inventor,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', inventor_g.City, inventor_g.State, inventor_g.Country) separator ';') as Inventor_address,
GROUP_CONCAT(DISTINCT CONCAT_WS(', ', usclass_g.Class, usclass_g.Subclass) separator ';') as USClass,
intclass_g.Section,
intclass_g.Class,
intclass_g.Subclass,
intclass_g.MainGroup,
intclass_g.SubGroup
FROM
gracit_g, grants as info_grant, assignee_g, inventor_g, usclass_g, intclass_g
WHERE
gracit_g.GrantID IN (*KEYS*)
and info_grant.GrantID = gracit_g.GrantID
and info_grant.GrantID = assignee_g.GrantID
and info_grant.GrantID = inventor_g.GrantID
and info_grant.GrantID = usclass_g.GrantID
and info_grant.GrantID = intclass_g.GrantID
GROUP BY gracit_g.citedID, info_grant.GrantID
LIMIT 50000000
这已经将它从21s持续时间/ 10s提取减少到13s持续时间/ 8s提取,我仍然希望改进,因为我有许多要通过的密钥。
答案 0 :(得分:2)
您的查询格式为:
SELECT some_fields
FROM (
SELECT other_fields
FROM table1, table2
WHERE join_condition_table1_table2 AND some_other_condition
) AS subquery, table3
WHERE join_condition_subquery_table3
GROUP BY another_field
您需要按如下方式重写它:
SELECT some_fields
FROM table1, table2, table3
WHERE
join_condition_table1_table2
AND join_condition_subquery_table3 -- actually rewrite this ans a join of either table1 and table3, or table2 and table3
AND some_other_condition
GROUP BY another_field
正如@Ollie Jones指出的那样,选择既不属于SELECT
条件也不属于聚合函数的字段(在GROUP BY
子句中)是危险的。如果这些字段不唯一地依赖于GROUP BY
条件中的字段,则这些字段的值是未定义的。
[编辑]
还有一些建议:
按此顺序(gracit_g(citedID, GrantID)
)在ALTER TABLE gracit_g ADD INDEX(citedID, GrantID);
上添加索引,并将GROUP BY
子句更改为GROUP BY gracit_g.citedID, gracit_g.GrantID
。优化器可能会使用此索引来计算GROUP BY
子句。
如果您的VARCHAR
主键实际上是数字,请将其类型更改为合适的整数类型。如果没有,请添加数字代理键并将其用作主键。整数比较要快得多,并且您在所有联接中进行了大量的比较。
预先计算额外列中的CONCAT_WS(', ', assignee_g.City, assignee_g.State, assignee_g.Country) separator ';')
之类的连接值,或额外的表(后者将需要每个表额外连接)
增加tmp_table_size
和max_heap_table_size
服务器选项。如果临时表大于这两个值中的任何一个(以字节为单位),则临时表不能保存在内存中并将写入磁盘。您可能会受益于异常大的值,因为您正在处理异常大的结果集。
我不知道是否还有其他事要做。您可能需要考虑返回较小的结果集(较少的列,或更多的过滤器,或较小的LIMIT
)。