Question

我正在对Wiki数据运行聚合查询。该查询尝试根据电影的流派和出版年份来计算电影的平均播放时间

查询中的多个分组/子查询旨在保留电影与分组标准（年份和类型）之间的n-1关系，以及电影与其时长之间的1-1关系。原因是聚合正确（OLAP和数据仓库从业人员熟悉n-1关系）。

更多解释已嵌入查询中。因此，我无法删除在子查询和if语句或组串联中完成的分组。该查询在Wikidata SPARQL endpoint上超时。

问题

我需要一些增强性能的建议... 任何优化提示 ？万一这是不可能的，任何人都知道某种 经过身份验证的方式 （这样他们就知道我不在玩）来查询Wikidata，从而可以增加超时时间，或者采取一种方法要 一般增加超时时间 ？

    # Average duration of films, grouped by their genre and the year of publication       
SELECT  
        ?genre1                    # film genre
        ?year1                     # film year of publication
        (AVG(?duration1) AS ?avg)   # film average duration

WHERE
        {      
            # Calculating the average duration for each single film.
            # As there are films with multiple duration, these durations are 
            # averagred by grouping aggregating durations by film.
            # Hence, a single duration for each film is projected out from the subquery.
            {
              select ?film (avg(?duration) as ?duration1)  
              where{
                ?film   <http://www.wikidata.org/prop/direct/P2047>   ?duration .    
              }group by ?film
            }

            # Here the grouping criteria (genre and year) are calculated.
            # The criteria is grouped by film, so that in case multiple 
            # genre/multiple year exist for a single film, all of them are
            # group concated into a single value.
            # Also in case of a lack of a value of year or genre for some
            # specific film, a dummy value "OtherYear"/"OtherGenre" is generated.
            {
              select ?film (
                                IF
                                (
                                    group_concat(distinct ?year ; separator="-- ") != "", 
                                    # In case multiple year exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?year ; separator="-- "), 
                                   # In case of a lack of a value of year for some specific film, a dummy value "OtherYear" is generated.
                                    "OtherYear"                                        
                                )
                                as ?year1
                              )
                                (
                                IF
                                (
                                    group_concat(distinct ?genre ; separator="-- ") != "",
                                    # In case multiple genre exist for a single film, all of them are group concated into a single value.
                                    group_concat(distinct ?genre ; separator="-- "), 
                                    # In case of a lack of a value of genre for some specific film, a dummy value "OtherGenre" is generated.
                                    "OtherGenre"  
                                )
                                as ?genre1
                              ) 

              where 
              {
                ?film  <http://www.wikidata.org/prop/direct/P31>  <http://www.wikidata.org/entity/Q11424> .
                 optional {
                   ?film   <http://www.wikidata.org/prop/direct/P577>  ?date .
                   BIND(year(?date) AS ?year)
                 }
                 optional {
                   ?film <http://www.wikidata.org/prop/direct/P136>  ?genre .
                 }
              } group by ?film              
          }

        } GROUP BY ?year1 ?genre1

Answer 1

用一个简单的IF（从组中选择一个任意值）替换两个sample表达式后，该查询似乎可以工作：

    (sample(?year) as ?year1)
    (sample(?genre) as ?genre1)

因此，看来group_concat的花费是主要问题。我觉得不是很直观，也没有解释。

也许带有sample的版本足够好，或者至少它可以为您提供进一步改进的基准点。

针对Wikidata优化聚合查询

1 个答案: