如何在pydruid中使用ThetaSketchOp函数

时间:2019-07-09 12:57:46

标签: python druid

我正在使用pydruid查询一个druid数据库,并想计算一种聚合为true而另一种为False的聚合后结果。

我已经能够使用curl将JSON格式的查询发布到druid数据库中来计算聚合后的结果。

使用pydruid,我已经能够计算初始聚集和两个聚集组的相交的后聚集。我试图找到一种使用ThetaSketchOp类的方法,但到目前为止没有成功。

到目前为止,这是我在pydruid中使用ThetaSketchOp类的尝试:

result = query.groupby(
    datasource='datasource',
    granularity='all',
    intervals='2018-06-30/2018-08-30',
    filter=(
        (filters.Dimension('fruit') == 'apple') |
        (filters.Dimension('fruit') == 'orange') 
    ),    
    aggregations={
        'apple': aggregators.filtered(
            filters.Dimension('fruit') == 'apple',
            aggregators.thetasketch('person')),
        'orange': aggregators.filtered(
            (filters.Dimension('fruit') == 'orange'),
            aggregators.thetasketch('person')),
    },
    post_aggregations={
        'apple_&_orange': postaggregator.ThetaSketchEstimate(
                postaggregator.ThetaSketch('apple') &
                postaggregator.ThetaSketch('orange')                
        ),
        'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
            postaggregator.ThetaSketchOp(
                fn='not', 
                fields=[
                    postaggregator.ThetaSketch('apple'),
                    postaggregator.ThetaSketch('orange')
                ],
                name='testing'
            )
        )
    }
)

这是json格式的查询,当用于查询druid数据库时会产生所需的结果:

{
"queryType": "groupBy",
  "dataSource": "datasource",
  "granularity": "ALL",
  "dimensions": [],
  "aggregations": [
    {
      "type" : "filtered",
      "filter" : {
        "type" : "selector",
        "dimension" : "fruit",
        "value" : "apple"
      },
      "aggregator" :     {
        "type": "thetaSketch", "name": "apple", "fieldName": "person"
      }
    },
    {
      "type" : "filtered",
      "filter" : {
        "type" : "selector",
        "dimension" : "fruit",
        "value" : "orange"
      },
      "aggregator" :     {
        "type": "thetaSketch", "name": "orange", "fieldName": "person"
      }
    }
  ],
  "postAggregations": [
    {
      "type": "thetaSketchEstimate",
      "name": "apple_&_orange",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "final_unique_users_sketch",
        "func": "INTERSECT",
        "fields": [
          {
            "type": "fieldAccess",
            "fieldName": "apple"
          },
          {
            "type": "fieldAccess",
            "fieldName": "orange"
          }
        ]
      }
    },
    {
      "type": "thetaSketchEstimate",
      "name": "apple_&_not_orange",
      "field":
      {
        "type": "thetaSketchSetOp",
        "name": "final_unique_users_sketch",
        "func": "NOT",
        "fields": [
          {
            "type": "fieldAccess",
            "fieldName": "apple"
          },
          {
            "type": "fieldAccess",
            "fieldName": "orange"
          }
        ]
      }
    }
  ],
  "intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}

感谢您的阅读。如果还有其他信息,请告诉我。

1 个答案:

答案 0 :(得分:0)

如果您使用!=运算符创建NOT theta草图op,似乎可以正常工作:

result = query.groupby(
    datasource='datasource',
    granularity='all',
    intervals='2018-06-30/2018-08-30',
    filter=(
        (filters.Dimension('fruit') == 'apple') |
        (filters.Dimension('fruit') == 'orange') 
    ),    
    aggregations={
        'apple': aggregators.filtered(
            filters.Dimension('fruit') == 'apple',
            aggregators.thetasketch('person')),
        'orange': aggregators.filtered(
            (filters.Dimension('fruit') == 'orange'),
            aggregators.thetasketch('person')),
    },
    post_aggregations={
        'apple_&_orange': postaggregator.ThetaSketchEstimate(
                postaggregator.ThetaSketch('apple') &
                postaggregator.ThetaSketch('orange')                
        ),
        'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
                    postaggregator.ThetaSketch('apple') !=
                    postaggregator.ThetaSketch('orange')
            )
    }
)

(我是通过研究pydruid源代码发现的。)