使用Map Type访问Spark Dataframe字段

时间:2017-07-25 21:40:32

标签: scala apache-spark apache-spark-sql spark-dataframe

我目前的架构如下

/assets/fonts

我将首先检查 product 中的任何元素是否为Items中的键,然后检查该条目值中的 _2 字段以查看它是否为小于某个值。 我的代码如下:

root
|-- product: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- Items: map (nullable = true)
|    |-- key: string
|    |-- value: struct (valueContainsNull = true)
|    |    |-- _1: string (nullable = true)
|    |    |-- _2: long (nullable = false)

我说错误

def has(product:Seq[String],items:Map[String,(String,Long,Long)]):Double={
var count = 0
for(x<- asin)
{
    if(items.contains(x))
    {
        val item = items.get(x)
        val iitem = item.get
        val(a,b,c) = iitem
        if(b<=rank)
        {
            count = count + 1
        }
    }
}
return count.toDouble
}

def hasId = udf((product:Seq[String] ,items:Map[String,(String,Long,Long)])

=>has(product,items)/items.size.toDouble
)

for(rank <- 0 to 47)
{
    joined =joined.withColumn("hasId"+rank,hasId(col("product"),col("items")))
}

错误似乎与

有关
GenericRowWithSchema cannot be cast to scala.Tuple3

但我无法弄清楚我做错了什么。

1 个答案:

答案 0 :(得分:2)

MapTypeArrayType列作为UDF输入传递时,元组值/键实际上作为org.apache.spark.sql.Row s 传递。您必须修改UDF以期望Map[String, Row]作为其第二个参数,并使用模式匹配将这些Row值“转换”为元组:

def hasId = udf((product: Seq[String], items: Map[String, Row]) =>
  has(product, items.mapValues {
    case Row(s: String, i1: Long, i2: Long) => (s, i1, i2)
  }) / items.size.toDouble
)

注意:与问题有些无关,看起来代码中还有其他一些错误 - 我认为rank应该作为参数传递到has?通过删除可变var s的用法,一切都可以变得更加惯用 - 我总体而言,猜测这就是你所需要的:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

def has(products: Seq[String], items: Map[String, (String, Long, Long)], rank: Long): Double = products
  .flatMap(items.get)
  .map(_._2)
  .count(_ <= rank)
  .toDouble

def hasId(rank: Long) = udf((product: Seq[String], items: Map[String, Row]) => {
  val convertedItems = items.mapValues {
    case Row(s: String, i1: Long, i2: Long) => (s, i1, i2)
  }
  has(product, convertedItems, rank) / items.size.toDouble
})

val result = (0 to 47).foldLeft(joined) {
  (df, rank) => df.withColumn("hasId" + rank, hasId(rank)(col("product"), col("items")))
}