包含SparseVector的Spark.ml DataFrame

时间:2017-02-15 22:17:03

标签: apache-spark apache-spark-mllib

我有一个spark.ml DataFrame,其中包含许多列,每列包含一行SparseVector。我想对每列应用MultivariateStatisticalSummary.colStatscolStats签名为:

def colStats(X: RDD[Vector]): MultivariateStatisticalSummary 

这看起来很完美......除了我select之前的DataFrame列中的RDD[Vector]并不能让它成为val df: DataFrame = data.select(shardId) val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd val s: MultivariateStatisticalSummary = Statistics.colStats(col) 。这是我的尝试:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._  Support for serializing other types will be added in future releases.
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd

没有用消息编译(在Scala中):

 val df = data.select(shardId)
 val col: RDD[Vector] = df.map(x => x.asInstanceOf[org.apache.spark.mllib.linalg.Vector])
 val s: MultivariateStatisticalSummary = Statistics.colStats(col)

我也尝试过:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to org.apache.spark.mllib.linalg.Vector

在运行时因错误而失败:

DataFrame

如何弥合colStatsPrivate Sub wb_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles wb.DocumentCompleted Timer1.Start() End Sub Private Sub Timer1_Tick(sender As Object, e As EventArgs) Handles Timer1.Tick Dim PWord As HtmlElement = wb.Document.GetElementById("password") If PWord IsNot Nothing Then PWord.InnerText = "password" Else MsgBox("fail..Again!") End If Timer1.Stop() End Sub 之间的差距?

1 个答案:

答案 0 :(得分:0)

毕竟我找到了答案:

 val df = data.select(shardId)
 val col: RDD[Vector] = df.map { _.get(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] }
 val s: MultivariateStatisticalSummary = Statistics.colStats(col)

诀窍只是在投射之前提取每一行的第一个元素。