我有以下DataFrame:
locals().update(test(**arguments))
我试图将其按ID分组,然后找到每个ID的平均向量(数组代表一个向量)。 由于我什至不知道该如何求和,所以现在进行平均似乎有点先进!
我写了以下代码:
[info] root
[info] |-- id: integer (nullable = true)
[info] |-- vec: array (nullable = true)
[info] | |-- element: double (containsNull = false)
[info] +----+--------------------+
[info] | id| vec|
[info] +----+--------------------+
[info] | 59|[-0.17827, 0.4178...|
[info] | 59|[-0.17827, 0.4178...|
[info] | 79|[-0.17827, 0.4178...|
[info] | 280|[-0.17827, 0.4178...|
[info] | 385|[-0.17827, 0.4178...|
[info] | 419|[-0.17827, 0.4178...|
[info] | 514|[-0.17827, 0.4178...|
[info] | 757|[-0.17827, 0.4178...|
[info] | 787|[-0.17827, 0.4178...|
[info] |1157|[-0.17827, 0.4178...|
[info] |1157|[-0.17827, 0.4178...|
[info] |1400|[-0.17827, 0.4178...|
[info] |1632|[-0.17827, 0.4178...|
[info] |1637|[-0.17827, 0.4178...|
[info] |1639|[-0.17827, 0.4178...|
[info] |1747|[-0.17827, 0.4178...|
[info] |1869|[-0.17827, 0.4178...|
[info] |1929|[-0.17827, 0.4178...|
[info] |1929|[-0.17827, 0.4178...|
[info] |2059|[-0.17827, 0.4178...|
[info] +----+--------------------+
每当我运行它,我都会得到以下堆栈跟踪:
val aggregated = joined_df
.rdd
.map{ case Row(k: Int, v: Array[Double]) => (k, v) }
.reduceByKey((acc,element) => (acc, element).zipped.map(_+_))
.toDF("id", "vec")
这是我第一次使用Spark,但是我已经达到了谷歌搜索的极限。我没头绪了。通过我的所有搜索,都应该可以。
答案 0 :(得分:1)
我认为这可能是问题所在。向量中的元素数量应与match子句匹配。
val vec1 = Vector(1,2,3)
vec1 match {
case Vector(a, b, c) => println("vector matched")
}
vec1 match {
case Vector(a, b) => println("vector matched")
}
在上面的示例中,第一个将成功,但随后将失败。
scala.MatchError: [59,WrappedArray(-0
可能是提示。