如何在数据框中过滤地图<string,int =“”>:Spark / Scala

时间:2016-10-18 03:08:25

标签: scala dictionary apache-spark dataframe filter

我正在尝试将计数单个列发布指标。我有一个df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]

现在我正在做:

val totalCustomers = df.count

val totalPurchaseCount = df.filter("totalPurchase > 0").count

val totalRentCount = df.filter("totalRent > 0").count


publishMetrics("Total Customer",  totalCustomers )
publishMetrics("Total Purchase",  totalPurchaseCount )
publishMetrics("Total Rent",  totalRentCount )

publishMetrics("Percentage of Rent",  percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase",  percentage(totalPurchaseCount, totalCustomers) )

private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}

但我不知道如何为itemTypeCounts这是一个地图。我希望根据每个键输入计数和百分比。问题是关键值是动态的,我的意思是我无法预先知道关键值。有人可以告诉我如何计算每个键值。我是scala / spark的新手,任何其他有效的方法来获得每列的计数都非常受欢迎。

示例数据:

customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}

customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}

customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}

所以输出是:

totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths  : 1
itemTypeCounts_Blender  : 1

3 个答案:

答案 0 :(得分:1)

您可以在Spark SQL中完成此操作,下面显示两个示例(其中一个已知键并且可以在代码中枚举,一个未知键)。请注意,通过使用Spark SQL,您可以利用催化剂优化器,这将非常高效地运行:

@IBAction func buttonWasTapped(_ sender: NSButton) {
    tapInt += 1
    scoreDisplay.stringValue = "\(tapInt)"
    defaults.setValue(tapInt, forKey: "counterKey")
    imageView.image = randomDoggoImage()
}

答案 1 :(得分:0)

我自己就是一个火花新手,所以可能有更好的方法来做到这一点。但是您可以尝试的一件事是将itemTypeCounts转换为可以使用的scala中的数据结构。我将每一行转换为(Name, Count)对的列表,例如List((Blender,2), (TV,4))

通过这个,你可以得到一个这样的对列表的列表,每行的一对列表。在您的示例中,这将是一个包含3个元素的列表:

List(
  List((Blender,2), (TV,4)), 
  List((Cloths,4)), 
  List((TV,4))
) 

一旦有了这个结构,将它转换为所需的输出就是标准的scala。

工作示例如下:

val itemTypeCounts = df.select("itemTypeCounts")

//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
  row =>
    val values = row.getStruct(0).mkString("",",","").split(",")
    val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
    fields.zip(values).filter(p => p._2 != "null")
}.toList

// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
  case Nil => summary
  case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}

//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
  val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
  val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
  updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}

val summaryMap = itemTypeCountsSummary(itemsList, Map())

summaryMap.foreach(e => println(e._1 + ": " + e._2 ))

输出:

itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1

答案 2 :(得分:0)

借用 Nick 的输入并使用 spark-sql 数据透视解决方案:

val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")

+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts         |
+----------+-------------+---------+-----------------------+
|1         |17           |0        |[TV -> 4, Blender -> 2]|
|2         |1            |1        |[Cloths -> 4]          |
|3         |0            |10       |[TV -> 4]              |
+----------+-------------+---------+-----------------------+

假设我们事先知道不同的 itemType,我们可以使用

val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3 
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') ) 
""")
dfr.show(false)

+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2  |1      |1     |
+---+-------+------+

对于重命名列,

dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)

+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2                |1                     |1                    |
+-----------------+----------------------+---------------------+

动态获取不同的 itemType 并将其传递给pivot

val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first

item_count_arr: Array[String] = Array(TV, Blender, Cloths)

spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3 
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) ) 
""").show(false)

+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2  |1      |1     |
+---+-------+------+