Spark DataFrame不尊重模式并将所有内容都视为String

时间:2016-03-14 14:23:48

标签: scala apache-spark apache-spark-sql apache-spark-mllib scala-collections

我正面临着一个问题,我现在已经很久没能解决这个问题了。

  1. 我在Spark 1.4和Scala 2.10上。此时我无法升级(大型分布式基础架构)

  2. 我有一个包含几百列的文件,其中只有两列是字符串,其余都是Long。我想将此数据转换为标签/功能数据框。

  3. 我已经能够将它变成LibSVM格式。

  4. 我无法将其转换为标签/功能格式。

  5. 原因是

    1. 我无法使用此处所示的toDF() https://spark.apache.org/docs/1.5.1/ml-ensembles.html

      val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
      

      它在1.4

    2. 中不受支持
    3. 所以我首先将txtFile转换为DataFrame,我使用了这样的东西

      def getColumnDType(columnName:String):StructField = {
      
              if((columnName== "strcol1") || (columnName== "strcol2")) 
                  return StructField(columnName, StringType, false)
              else
                  return StructField(columnName, LongType, false)
          }
      def getDataFrameFromTxtFile(sc: SparkContext,staticfeatures_filepath: String,schemaConf: String) : DataFrame = {
              val sfRDD = sc.textFile(staticfeatures_filepath)//
              val sqlContext = new org.apache.spark.sql.SQLContext(sc)
               // reads a space delimited string from application.properties file
              val schemaString = readConf(Array(schemaConf)).get(schemaConf).getOrElse("")
      
              // Generate the schema based on the string of schema
              val schema =
                StructType(
                  schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
      
              val data = sfRDD
              .map(line => line.split(","))
              .map(p => Row.fromSeq(p.toSeq))
      
              var df = sqlContext.createDataFrame(data, schema)
      
              //schemaString.split(" ").drop(4)
              //.map(s => df = convertColumn(df, s, "int"))
      
              return df
          }   
      
    4. 当我使用此返回的数据帧执行df.na.drop() df.printSchema()时,我会得到完美的架构

      root
       |-- rand_entry: long (nullable = false)
       |-- strcol1: string (nullable = false)
       |-- label: long (nullable = false)
       |-- strcol2: string (nullable = false)
       |-- f1: long (nullable = false)
       |-- f2: long (nullable = false)
       |-- f3: long (nullable = false)
      and so on till around f300
      

      但是 - 令人伤心的部分是我尝试用df做的任何事情(见下文),我总是得到一个ClassCastException(java.lang.String不能强制转换为java.lang.Long)

      val featureColumns = Array("f1","f2",....."f300")
      assertEquals(-99,df.select("f1").head().getLong(0))
      assertEquals(-99,df.first().get(4))
      val transformeddf = new VectorAssembler()
              .setInputCols(featureColumns)
              .setOutputCol("features")
              .transform(df)
      

      所以 - 糟糕的是 - 即使模式显示为Long - df仍然在内部将所有内容都视为字符串。

      修改

      添加一个简单的例子

      说我有这样的文件

      1,A,20,P,-99,1,0,0,8,1,1,1,1,131153
      1,B,23,P,-99,0,1,0,7,1,1,0,1,65543
      1,C,24,P,-99,0,1,0,9,1,1,1,1,262149
      1,D,7,P,-99,0,0,0,8,1,1,1,1,458759
      

      sf-schema=f0 strCol1 f1 strCol2 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11
      

      (列名无关紧要因此您可以忽略此细节)

      我要做的就是创建一个Label / Features类型的数据框,其中我的第3列成为标签,第5到第11列成为要素[Vector]列。这样我就可以按照https://spark.apache.org/docs/latest/ml-classification-regression.html#tree-ensembles中的其他步骤进行操作了。

      我已按照零323

      的建议投射列
      val types = Map("strCol1" -> "string", "strCol2" -> "string")
              .withDefault(_ => "bigint")
      df = df.select(df.columns.map(c => df.col(c).cast(types(c)).alias(c)): _*)
      df = df.drop("f0")
      df = df.drop("strCol1")
      df = df.drop("strCol2")
      

      但断言和VectorAssembler仍然失败。 featureColumns = Array(" f2"," f3",....." f11") 这是我拥有自己的df后要做的整个序列

          var transformeddf = new VectorAssembler()
          .setInputCols(featureColumns)
          .setOutputCol("features")
          .transform(df)
      
          transformeddf.show(2)
      
          transformeddf = new StringIndexer()
          .setInputCol("f1")
          .setOutputCol("indexedF1")
          .fit(transformeddf)
          .transform(transformeddf)
      
          transformeddf.show(2)
      
          transformeddf = new VectorIndexer()
          .setInputCol("features")
          .setOutputCol("indexedFeatures")
          .setMaxCategories(5)
          .fit(transformeddf)
          .transform(transformeddf)
      

      来自ScalaIDE的异常跟踪 - 就在它击中VectorAssembler时如下

      java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
          at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
          at scala.math.Numeric$LongIsIntegral$.toDouble(Numeric.scala:117)
          at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
          at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
          at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:436)
          at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
          at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
          at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
          at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
          at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
          at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
          at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
          at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
          at scala.collection.AbstractTraversable.map(Traversable.scala:105)
          at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:75)
          at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:56)
          at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:72)
          at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
          at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)
          at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
          at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
          at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
          at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
          at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
          at scala.collection.Iterator$class.foreach(Iterator.scala:727)
          at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
          at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
          at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
          at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
          at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
          at scala.collection.AbstractIterator.to(Iterator.scala:1157)
          at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
          at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
          at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
          at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
          at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
          at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
          at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
          at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
          at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
          at org.apache.spark.scheduler.Task.run(Task.scala:70)
          at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          at java.lang.Thread.run(Thread.java:745)
      

1 个答案:

答案 0 :(得分:8)

你得到ClassCastException,因为这正是应该发生的事情。 Schema参数不用于自动转换(某些DataSources可能以这种方式使用模式,但不使用createDataFrame之类的方法。它只声明存储在行中的值的类型。您有责任传递与架构匹配的数据,而不是相反。

虽然DataFrame显示了模式,但您已声明它仅在运行时验证,因此运行时异常。如果您想将数据转换为特定数据,则显式拥有cast数据。

  1. 首先将所有列都读为StringType

    val rows = sc.textFile(staticfeatures_filepath)
      .map(line => Row.fromSeq(line.split(",")))
    
    val schema = StructType(
      schemaString.split(" ").map(
        columnName => StructField(columnName, StringType, false)
      )
    )
    
    val df = sqlContext.createDataFrame(rows, schema)
    
  2. 接下来将所选列投射到所需类型:

    import org.apache.spark.sql.types.{LongType, StringType}
    
    val types = Map("strcol1" -> StringType, "strcol2" -> StringType)
      .withDefault(_ => LongType)
    
    val casted = df.select(df.columns.map(c => col(c).cast(types(c)).alias(c)): _*)
    
  3. 使用汇编程序:

    val transformeddf = new VectorAssembler()
      .setInputCols(featureColumns)
      .setOutputCol("features")
      .transform(casted)
    
  4. 您可以使用spark-csv简单地执行步骤1和步骤2:

    // As originally 
    val schema = StructType(
      schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
    
    
    val df = sqlContext
      .read.schema(schema)
      .format("com.databricks.spark.csv")
      .option("header", "false")
      .load(staticfeatures_filepath)
    

    另见Correctly reading the types from file in PySpark