我目前的架构如下
/assets/fonts
我将首先检查 product 中的任何元素是否为Items中的键,然后检查该条目值中的 _2 字段以查看它是否为小于某个值。 我的代码如下:
root
|-- product: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Items: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: long (nullable = false)
我说错误
def has(product:Seq[String],items:Map[String,(String,Long,Long)]):Double={
var count = 0
for(x<- asin)
{
if(items.contains(x))
{
val item = items.get(x)
val iitem = item.get
val(a,b,c) = iitem
if(b<=rank)
{
count = count + 1
}
}
}
return count.toDouble
}
def hasId = udf((product:Seq[String] ,items:Map[String,(String,Long,Long)])
=>has(product,items)/items.size.toDouble
)
for(rank <- 0 to 47)
{
joined =joined.withColumn("hasId"+rank,hasId(col("product"),col("items")))
}
错误似乎与
有关GenericRowWithSchema cannot be cast to scala.Tuple3
但我无法弄清楚我做错了什么。
答案 0 :(得分:2)
将MapType
或ArrayType
列作为UDF输入传递时,元组值/键实际上作为org.apache.spark.sql.Row
s 传递。您必须修改UDF以期望Map[String, Row]
作为其第二个参数,并使用模式匹配将这些Row
值“转换”为元组:
def hasId = udf((product: Seq[String], items: Map[String, Row]) =>
has(product, items.mapValues {
case Row(s: String, i1: Long, i2: Long) => (s, i1, i2)
}) / items.size.toDouble
)
注意:与问题有些无关,看起来代码中还有其他一些错误 - 我认为rank
应该作为参数传递到has
?通过删除可变var
s的用法,一切都可以变得更加惯用 - 我总体而言,猜测这就是你所需要的:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
def has(products: Seq[String], items: Map[String, (String, Long, Long)], rank: Long): Double = products
.flatMap(items.get)
.map(_._2)
.count(_ <= rank)
.toDouble
def hasId(rank: Long) = udf((product: Seq[String], items: Map[String, Row]) => {
val convertedItems = items.mapValues {
case Row(s: String, i1: Long, i2: Long) => (s, i1, i2)
}
has(product, convertedItems, rank) / items.size.toDouble
})
val result = (0 to 47).foldLeft(joined) {
(df, rank) => df.withColumn("hasId" + rank, hasId(rank)(col("product"), col("items")))
}