我正在从存储在HDFS上的文件夹中读取数据流。我有以下一小段代码:
// Convert text into a DataSet of LogEntry rows. Select the two columns we care about
val df = rawData.flatMap(parseLog).select("ip", "status")
df .isStreaming
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(df)
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(df)
println(s"Within Set Sum of Squared Errors = $WSSSE")
// Shows the K-means result
println("Cluster Centers: ")
model.clusterCenters.foreach(println)
当我运行上述操作时,出现以下错误:
java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:93)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:254)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:340)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:305)
at StructuredStreaming$.main(<console>:189)
... 90 elided
我完全被这个
困住了。任何帮助将不胜感激。
更新
我对EmiCareOfCell44做了以下修改答案:
import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("ip", "status")).setOutputCol("features")
val output = assembler.transform(df).select("features")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(output)
代码现在加载但是当我去运行它时,我收到以下错误:
java.lang.IllegalArgumentException: Data type StringType is not supported.
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
at StructuredStreaming$.main(<console>:129)
... 60 elided
我认为它越来越近了,只需要调整一下。
答案 0 :(得分:1)
您必须首先使用VectorAssembler来创建特征向量。类似的东西:
val assembler = new VectorAssembler().setInputCols(Array("ip", "status")).setOutputCol("features")
val df2 = assembler.transform(df).select("features")
val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(df2)