我正在尝试使用SPARQL从三元组商店生成一些用户统计信息。请参阅下面的查询。如何改进?我在做一些邪恶的事吗?为什么这会消耗这么多内存? (参见本文末尾的背景故事)
我更喜欢在三重商店内进行聚合和连接。拆分查询意味着我必须在数据库外“手动”加入结果,从而失去三重存储的效率和优化。没有必要重新发明轮子。
查询
SELECT
?person
(COUNT(DISTINCT ?sent_email) AS ?sent_emails)
(COUNT(DISTINCT ?received_email) AS ?received_emails)
(COUNT(DISTINCT ?receivedInCC_email) AS ?receivedInCC_emails)
(COUNT(DISTINCT ?revision) AS ?commits)
WHERE {
?person rdf:type foaf:Person.
OPTIONAL {
?sent_email rdf:type email:Email.
?sent_email email:sender ?person.
}
OPTIONAL {
?received_email rdf:type email:Email.
?received_email email:recipient ?person.
}
OPTIONAL {
?receivedInCC_email rdf:type email:Email.
?receivedInCC_email email:ccRecipient ?person.
}
OPTIONAL {
?revision rdf:type vcs:VcsRevision.
?revision vcs:committedBy ?person.
}
}
GROUP BY ?person
ORDER BY DESC(?commits)
背景
问题是我在AllegroGraph中收到错误“QUERY MEMORY LIMIT REACHED”(请同时查看我的相关SO question)。由于存储库只包含大约200k三元组,它很容易适合大约(ntriples)的输入文件。 60 MB,我想知道如何执行查询结果需要超过4 GB的RAM,大约高出两个数量级。
答案 0 :(得分:0)
尝试在子查询中拆分计算,例如:
SELECT
?person
(MAX(?sent_emails_) AS ?sent_emails_)
(MAX(?received_emails_ AS ?received_emails_)
(MAX(?receivedInCC_emails_ AS ?receivedInCC_emails_)
(MAX(?commits_) AS ?commits)
WHERE {
{
SELECT
?person
(COUNT(DISTINCT ?sent_email) AS ?sent_emails_)
(0 AS ?received_emails_)
(0 AS ?commits_)
WHERE {
?sent_email rdf:type email:Email.
?sent_email email:sender ?person.
?person rdf:type foaf:Person.
} GROUP BY ?person
} union {
(similar pattern for the others)
....
}
}
GROUP BY ?person
ORDER BY DESC(?commits)
目标是: