针对Wikidata优化聚合查询

时间:2019-05-03 10:38:49

标签: sparql wikidata

我正在对Wiki数据运行聚合查询。该查询尝试根据电影的流派和出版年份来计算电影的平均播放时间

查询中的多个分组/子查询旨在保留电影与分组标准(年份和类型)之间的n-1关系,以及电影与其时长之间的1-1关系。原因是聚合正确(OLAP和数据仓库从业人员熟悉n-1关系)。

更多解释已嵌入查询中。因此,我无法删除在子查询和if语句或组串联中完成的分组。该查询在Wikidata SPARQL endpoint上超时。

问题

我需要一些增强性能的建议... 任何优化提示 ?万一这是不可能的,任何人都知道某种 经过身份验证的方式 (这样他们就知道我不在玩)来查询Wikidata,从而可以增加超时时间,或者采取一种方法要 一般增加超时时间

    # Average duration of films, grouped by their genre and the year of publication       
SELECT  
        ?genre1                    # film genre
        ?year1                     # film year of publication
        (AVG(?duration1) AS ?avg)   # film average duration

WHERE
        {      
            # Calculating the average duration for each single film.
            # As there are films with multiple duration, these durations are 
            # averagred by grouping aggregating durations by film.
            # Hence, a single duration for each film is projected out from the subquery.
            {
              select ?film (avg(?duration) as ?duration1)  
              where{
                ?film   <http://www.wikidata.org/prop/direct/P2047>   ?duration .    
              }group by ?film
            }

            # Here the grouping criteria (genre and year) are calculated.
            # The criteria is grouped by film, so that in case multiple 
            # genre/multiple year exist for a single film, all of them are
            # group concated into a single value.
            # Also in case of a lack of a value of year or genre for some
            # specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
            {
              select ?film (
                                IF
                                (
                                    group_concat(distinct ?year ; separator="-- ") != "", 
                                    # In case multiple year exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?year ; separator="-- "), 
                                   # In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
                                    "OtherYear"                                        
                                )
                                as ?year1
                              )
                                (
                                IF
                                (
                                    group_concat(distinct ?genre ; separator="-- ") != "",
                                    # In case multiple genre exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?genre ; separator="-- "), 
                                    # In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
                                    "OtherGenre"  
                                )
                                as ?genre1
                              ) 

              where 
              {
                ?film  <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q11424> .
                 optional {
                   ?film   <http://www.wikidata.org/prop/direct/P577>  ?date .
                   BIND(year(?date) AS ?year)
                 }
                 optional {
                   ?film <http://www.wikidata.org/prop/direct/P136>  ?genre .
                 }
              } group by ?film              
          }

        } GROUP BY ?year1 ?genre1

1 个答案:

答案 0 :(得分:1)

用一个简单的IF(从组中选择一个任意值)替换两个sample表达式后,该查询似乎可以工作:

    (sample(?year) as ?year1)
    (sample(?genre) as ?genre1) 

因此,看来group_concat的花费是主要问题。我觉得不是很直观,也没有解释。

也许带有sample的版本足够好,或者至少它可以为您提供进一步改进的基准点。