如何优化返回可选属性的SPARQL查询?

时间:2017-04-26 18:43:04

标签: sparql marklogic semantic-web marklogic-8

如何优化SPARQL查询,如下所示?

此查询的目的是:

  1. 指定资源(国家资源countryCode = "US"
  2. 获取资源上定义的可选属性。
  3. 不幸的是,OPTIONAL块正在父块之前进行评估,这会导致查询引擎加载所有国家/地区的所有数据。

    我想要的是LEFT OUTER JOIN行为,但查询引擎并没有这样处理它。

    我可以做些什么来提高查询效果?

    SELECT  *
    WHERE
      { 
        ?type (rdfs:subClassOf)* gj:Country .
        ?this_0  rdf:type        ?type ;
                 gn:countryCode  "US"
        # each of these blocks is executed as a standalone query in the engine
        OPTIONAL
          { ?this_0  gn:countryCode  ?countryCode_1}
        OPTIONAL
          { ?this_0  gn:name  ?name_2}
        OPTIONAL
          { ?this_0 gj:cscId  ?cscId_3} 
      }
    

    我在MarkLogic 8.4中使用SPARQL REST端点。

    更新

    我尝试使用 optimize=2 选项查询,但它没有给我带来显着的性能提升:

    /v1/graphs/sparql?optimize=2

    相关: How do I specify options in the SPARQL REST endpoint for MarkLogic?

    更新2:

    即使我需要其中一个可选属性,查询仍然运行缓慢:

    WHERE
      {
            ?type (rdfs:subClassOf)* gj:Country .
            ?this_0  rdf:type        ?type ;
                 gn:countryCode  "US"; gj:cscId ?cscId_3 ;
      }
    

    我是否需要做一些特殊的事情来索引这个gj:cscId属性?

    更新3:

    以下是查询控制台中的配置文件信息。

    Query profile

    更新4:

    以下是诊断跟踪信息:

    2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
    2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
    2017-04-27 13:30:17.238 Info:   <triple-value-entries>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="181">
    2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="15">
    2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8739716">
    2017-04-27 13:30:17.238 Info:       <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="8697064">
    2017-04-27 13:30:17.238 Info:       <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="2119305">
    2017-04-27 13:30:17.238 Info:       <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:     <triple-value-entry count="13946907">
    2017-04-27 13:30:17.238 Info:       <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
    2017-04-27 13:30:17.238 Info:       <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
    2017-04-27 13:30:17.238 Info:       <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
    2017-04-27 13:30:17.238 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
    2017-04-27 13:30:17.238 Info:     </triple-value-entry>
    2017-04-27 13:30:17.238 Info:   </triple-value-entries>
    2017-04-27 13:30:17.238 Info: </triple-value-statistics>
    2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
    2017-04-27 13:30:17.239 Info:   initialPlan=SPARQLModule[
    2017-04-27 13:30:17.239 Info:   Prolog[]
    2017-04-27 13:30:17.239 Info:   SPARQLSelect[SPARQLProject[order()
    2017-04-27 13:30:17.239 Info:       GraphNode[Var type 0]
    2017-04-27 13:30:17.239 Info:       GraphNode[Var this_0 1]
    2017-04-27 13:30:17.239 Info:       GraphNode[Var cscId_3 2]
    2017-04-27 13:30:17.239 Info:       SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
    2017-04-27 13:30:17.239 Info:         SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
    2017-04-27 13:30:17.239 Info:           SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
    2017-04-27 13:30:17.239 Info:             SPARQLZeroOrOne[
    2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]
    2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
    2017-04-27 13:30:17.239 Info:               SPARQLScatterOneOrMore[
    2017-04-27 13:30:17.239 Info:                 GraphNode[Var type 0]
    2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON16629111911678922088 0]
    2017-04-27 13:30:17.239 Info:                 GraphNode[Var ANON7634081659815295853 1]
    2017-04-27 13:30:17.239 Info:                 GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
    2017-04-27 13:30:17.239 Info:                 TriplePattern[order(0,1) PSO
    2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON16629111911678922088 0]
    2017-04-27 13:30:17.239 Info:                   GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
    2017-04-27 13:30:17.239 Info:                   GraphNode[Var ANON7634081659815295853 1]]]]
    2017-04-27 13:30:17.239 Info:             TriplePattern[order(0,1) OPS
    2017-04-27 13:30:17.239 Info:               GraphNode[Var this_0 1]
    2017-04-27 13:30:17.239 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
    2017-04-27 13:30:17.239 Info:               GraphNode[Var type 0]]]
    2017-04-27 13:30:17.239 Info:           TriplePattern[order(1) SOP
    2017-04-27 13:30:17.239 Info:             GraphNode[Var this_0 1]
    2017-04-27 13:30:17.239 Info:             GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
    2017-04-27 13:30:17.239 Info:             GraphNode[Literal "US"]]]
    2017-04-27 13:30:17.239 Info:         TriplePattern[order(1,2) PSO
    2017-04-27 13:30:17.239 Info:           GraphNode[Var this_0 1]
    2017-04-27 13:30:17.239 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
    2017-04-27 13:30:17.239 Info:           GraphNode[Var cscId_3 2]]]]]]
    2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
    2017-04-27 13:30:17.239 Info:   initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
    2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
    2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
    2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
    2017-04-27 13:30:17.320 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
    2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
    2017-04-27 13:30:17.326 Info:   cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
    2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
    2017-04-27 13:30:17.326 Info:   bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
    2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
    2017-04-27 13:30:17.326 Info:   plan=SPARQLModule[
    2017-04-27 13:30:17.326 Info:   Prolog[]
    2017-04-27 13:30:17.326 Info:   SPARQLSelect[SPARQLProject[order(1,0)
    2017-04-27 13:30:17.326 Info:       GraphNode[Var type 0]
    2017-04-27 13:30:17.326 Info:       GraphNode[Var this_0 1]
    2017-04-27 13:30:17.326 Info:       GraphNode[Var cscId_3 2]
    2017-04-27 13:30:17.326 Info:       SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
    2017-04-27 13:30:17.326 Info:         TriplePattern[order(1,2) PSO
    2017-04-27 13:30:17.326 Info:           GraphNode[Var this_0 1]
    2017-04-27 13:30:17.326 Info:           GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
    2017-04-27 13:30:17.326 Info:           GraphNode[Var cscId_3 2]]
    2017-04-27 13:30:17.326 Info:         SPARQLHashJoin[order(1,0) hash(0==0) scatter()
    2017-04-27 13:30:17.326 Info:           SPARQLZeroOrOne[
    2017-04-27 13:30:17.326 Info:             GraphNode[Var type 0]
    2017-04-27 13:30:17.326 Info:             GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
    2017-04-27 13:30:17.326 Info:             SPARQLBloomOneOrMore[
    2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
    2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON7634081659815295853 1]
    2017-04-27 13:30:17.326 Info:               GraphNode[Var ANON16629111911678922088 0]
    2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]
    2017-04-27 13:30:17.326 Info:               TriplePattern[order(0,1) PSO
    2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON16629111911678922088 0]
    2017-04-27 13:30:17.326 Info:                 GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
    2017-04-27 13:30:17.326 Info:                 GraphNode[Var ANON7634081659815295853 1]]]]
    2017-04-27 13:30:17.326 Info:           SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
    2017-04-27 13:30:17.326 Info:             TriplePattern[order(1) OPS
    2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
    2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
    2017-04-27 13:30:17.326 Info:               GraphNode[Literal "US"]]
    2017-04-27 13:30:17.326 Info:             TriplePattern[order(1,0) PSO
    2017-04-27 13:30:17.326 Info:               GraphNode[Var this_0 1]
    2017-04-27 13:30:17.326 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
    2017-04-27 13:30:17.326 Info:               GraphNode[Var type 0]]]]]]]]
    

    更新5:

    在某些用例中,我发现可以从查询中删除?type属性路径表达式。在一个这样的情况下,性能提高了两个数量级:

    WHERE
      { 
        ?this_0  rdf:type        gj:Country ;
                 gn:countryCode  "US"
        # each of these blocks is executed as a standalone query in the engine
        OPTIONAL
          { ?this_0  gn:countryCode  ?countryCode_1}
        OPTIONAL
          { ?this_0  gn:name  ?name_2}
        OPTIONAL
          { ?this_0 gj:cscId  ?cscId_3} 
      }
    

    由于此解决方案更改了查询的输出,因此它无法解决所有用例。

    似乎问题不在于OPTIONAL本身,而是与属性路径表达式混淆查询规划器有关,因此可以独立查找OPTIONAL块中的属性(这不是高性能)。

2 个答案:

答案 0 :(得分:5)

查询优化器依赖于使用统计信息来确定最佳操作顺序。通常会有一个限制性三重模式,可用于限制使用分散连接的进一步操作。

在您的情况下,统计数据不会提供如此明显的限制性三重模式。您可以通过查看三重值统计信息输出中的字符串&#34; US&#34;作为一个对象发生2061783次 - 因此不是非常严格的限制。

gj:Country IRI是限制性的(在对象位置上是179次),但不幸的是你需要在传递闭包运算符的右侧使用它。很难预测传递闭包运算符将返回多少结果,因为它在很大程度上取决于实际数据。

您会发现使用类似下面的属性路径将允许MarkLogic避免使用零或一运算符,这可能会带来很小的性能提升:

?this_0 a/rdfs:subClassOf* gj:Country .

如果您知道(例如)只有一个gj:国家/地区代码为&#34; US&#34;的国家/地区,您可以添加一个限制到该部分查询以提供优化程序提示如何处理查询,即:

select * {
  {
    select * {
      ?this_0 a/rdfs:subClassOf* gj:Country .
      ?this_0  gn:countryCode  'US' .
    } limit 1
  }
  OPTIONAL { ?this_0 gj:cscId  ?cscId_3 } 
}

答案 1 :(得分:0)

Marklogic 8似乎在使用package org.your.package import java.sql.DriverManager; import java.sql.Connection; import java.sql.Statement; import java.sql.ResultSet public class MyClass { public static void main (args[]) { Connection connection = DriverManager.getConnection("jdbc:neo4j:bolt://localhost"); try (Statement stmt = con.createStatement()) { ResultSet rs = stmt.executeQuery("MATCH (n:User) RETURN n.name"); while (rs.next()) { System.out.println(rs.getString("n.name")); } } con.close(); } } 的属性路径方面存在性能问题。尝试替换

*

?type (rdfs:subClassOf)* gj:Country .