我有一个spark.ml DataFrame
,其中包含许多列,每列包含一行SparseVector
。我想对每列应用MultivariateStatisticalSummary.colStats
,colStats
签名为:
def colStats(X: RDD[Vector]): MultivariateStatisticalSummary
这看起来很完美......除了我select
之前的DataFrame
列中的RDD[Vector]
并不能让它成为val df: DataFrame = data.select(shardId)
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
。这是我的尝试:
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
没有用消息编译(在Scala中):
val df = data.select(shardId)
val col: RDD[Vector] = df.map(x => x.asInstanceOf[org.apache.spark.mllib.linalg.Vector])
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
我也尝试过:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to org.apache.spark.mllib.linalg.Vector
在运行时因错误而失败:
DataFrame
如何弥合colStats
和Private Sub wb_DocumentCompleted(sender As Object, e As WebBrowserDocumentCompletedEventArgs) Handles wb.DocumentCompleted
Timer1.Start()
End Sub
Private Sub Timer1_Tick(sender As Object, e As EventArgs) Handles Timer1.Tick
Dim PWord As HtmlElement = wb.Document.GetElementById("password")
If PWord IsNot Nothing Then
PWord.InnerText = "password"
Else
MsgBox("fail..Again!")
End If
Timer1.Stop()
End Sub
之间的差距?
答案 0 :(得分:0)
毕竟我找到了答案:
val df = data.select(shardId)
val col: RDD[Vector] = df.map { _.get(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] }
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
诀窍只是在投射之前提取每一行的第一个元素。