如何使用Algebird的HyperLogLogMonoid执行任意交叉和联合

时间:2016-11-07 03:41:35

标签: scala scalding

我想将属于特定类别的一堆值聚合到HLL数据结构中,这样我就可以在以后执行交叉点和联合,并计算这些计算的基数。

我能够使用com.twitter.algebird.HyperLogLogAggregator

来估算每个组的基数。

我需要帮助使用com.twitter.algebird.HyperLogLogMonoid存储为HLL,然后用于计算交叉点/联合。

val lines_parsed = lines.map { line => parseBlueKaiLogEntry(line) }
# (uuid, [category id array])

val lines_parsed_flat = lines_parsed.flatMap { 
  case(uuid, category_list) => category_list.toList.map {
       category_id => (category_id, uuid) 
     }
}
# (category_id, uuid)

# Group by category
val lines_parsed_grped = lines_parsed_flat.groupBy { 
        case (cat_id, uuid) => cat_id 
      }

# Define HLL aggregator
val hll_uniq = HyperLogLogAggregator.sizeAggregator(bits=12).composePrepare[(String, String)]{case(cat_id, uuid) => uuid.toString.getBytes("UTF-8")}

# Aggregate using hll count
lines_parsed_grped.aggregate(hll_uniq).dump
# (category_id, count) - expected output

现在,我尝试使用HLL Monoid

# I now want to store as HLL and this is where I'm not sure what to do
# Create HLL Monoid
val hll = new HyperLogLogMonoid(bits = 12)

val lines_grped_hll = lines_parsed_grped.mapValues { case(cat_id:String, uuid:String ) =>  uuid}.values.map {v:String => hll.create(v.getBytes("UTF-8"))}

# Calling dump results in a lot more lines that I expect to see
lines_grped_hll.dump

我在做什么呢?

1 个答案:

答案 0 :(得分:0)

使用:

val result  =  hll.sum(lines_grped_hll) //or suitable method of hll for you

result.dump