我正在使用pydruid查询一个druid数据库,并想计算一种聚合为true而另一种为False的聚合后结果。
我已经能够使用curl将JSON格式的查询发布到druid数据库中来计算聚合后的结果。
使用pydruid,我已经能够计算初始聚集和两个聚集组的相交的后聚集。我试图找到一种使用ThetaSketchOp类的方法,但到目前为止没有成功。
到目前为止,这是我在pydruid中使用ThetaSketchOp类的尝试:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketchOp(
fn='not',
fields=[
postaggregator.ThetaSketch('apple'),
postaggregator.ThetaSketch('orange')
],
name='testing'
)
)
}
)
这是json格式的查询,当用于查询druid数据库时会产生所需的结果:
{
"queryType": "groupBy",
"dataSource": "datasource",
"granularity": "ALL",
"dimensions": [],
"aggregations": [
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "apple"
},
"aggregator" : {
"type": "thetaSketch", "name": "apple", "fieldName": "person"
}
},
{
"type" : "filtered",
"filter" : {
"type" : "selector",
"dimension" : "fruit",
"value" : "orange"
},
"aggregator" : {
"type": "thetaSketch", "name": "orange", "fieldName": "person"
}
}
],
"postAggregations": [
{
"type": "thetaSketchEstimate",
"name": "apple_&_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "INTERSECT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
},
{
"type": "thetaSketchEstimate",
"name": "apple_&_not_orange",
"field":
{
"type": "thetaSketchSetOp",
"name": "final_unique_users_sketch",
"func": "NOT",
"fields": [
{
"type": "fieldAccess",
"fieldName": "apple"
},
{
"type": "fieldAccess",
"fieldName": "orange"
}
]
}
}
],
"intervals": [ "2018-06-30T23:00:05.000Z/2019-07-01T17:00:05.000Z" ]
}
感谢您的阅读。如果还有其他信息,请告诉我。
答案 0 :(得分:0)
如果您使用!=
运算符创建NOT
theta草图op,似乎可以正常工作:
result = query.groupby(
datasource='datasource',
granularity='all',
intervals='2018-06-30/2018-08-30',
filter=(
(filters.Dimension('fruit') == 'apple') |
(filters.Dimension('fruit') == 'orange')
),
aggregations={
'apple': aggregators.filtered(
filters.Dimension('fruit') == 'apple',
aggregators.thetasketch('person')),
'orange': aggregators.filtered(
(filters.Dimension('fruit') == 'orange'),
aggregators.thetasketch('person')),
},
post_aggregations={
'apple_&_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') &
postaggregator.ThetaSketch('orange')
),
'apple_&_not_orange': postaggregator.ThetaSketchEstimate(
postaggregator.ThetaSketch('apple') !=
postaggregator.ThetaSketch('orange')
)
}
)
(我是通过研究pydruid源代码发现的。)