Question

由于VectorAssembler正在崩溃，如果传递的列具有除NumericType或BooleanType之外的任何其他类型且我正在处理大量TimestampType列，想知道：

是否有一种简单的方法可以立即投射多列？

基于this answer我已经有了一种方便的方法来构建一个列：

def castColumnTo(df: DataFrame, 
    columnName: String, 
    targetType: DataType ) : DataFrame = {
      df.withColumn( columnName, df(columnName).cast(targetType) )
}

我考虑过以递归方式调用castColumnTo，但我强烈怀疑这是（高效）的方式。

Answer 1

在scala

中使用惯用方法转换所有列

def castAllTypedColumnsTo(df: DataFrame, sourceType: DataType, targetType: DataType) = {
df.schema.filter(_.dataType == sourceType).foldLeft(df) {
    case (acc, col) => acc.withColumn(col.name, df(col.name).cast(targetType))
 }
}

Answer 2

基于评论（谢谢！）我想出了以下代码（没有实现错误处理）：

def castAllTypedColumnsTo(df: DataFrame, 
   sourceType: DataType, targetType: DataType) : DataFrame = {

      val columnsToBeCasted = df.schema
         .filter(s => s.dataType == sourceType)

      //if(columnsToBeCasted.length > 0) {
      //   println(s"Found ${columnsToBeCasted.length} columns " +
      //      s"(${columnsToBeCasted.map(s => s.name).mkString(",")})" +
      //      s" - casting to ${targetType.typeName.capitalize}Type")
      //}

      columnsToBeCasted.foldLeft(df){(foldedDf, col) => 
         castColumnTo(foldedDf, col.name, LongType)}
}

感谢鼓舞人心的评论。 foldLeft（解释here和here）保存for循环以迭代var数据框。

Answer 3

FastDf = (spark.read.csv("Something.csv", header = False, mode="DRPOPFORMED"))
FastDf.OldTypes = [feald.dataType for feald in FastDf.schema.fields]
FastDf.NewTypes = [StringType(), FloatType(), FloatType(), IntegerType()]
FastDf.OldColnames = FastDf.columns
FastDf.NewColnames = ['S_tring', 'F_loat', 'F_loat2', 'I_nteger']
FastDfSchema = FastDf.select(*
                             (FastDf[colnumber]
                              .cast(FastDf.NewTypes[colnumber])
                              .alias(FastDf.NewColnames[colnumber]) 
                                  for colnumber in range(len(FastDf.NewTypes)
                                                )
                             )
                            )

我知道它在pyspark中，但逻辑可能很方便。

Answer 4

我正在为python翻译scala程序。我找到了你问题的聪明答案。该列名为V1 - V28，Time，Amount，Class。（我不是Scala pro）解决方案看起来像这样。

// cast all the column to Double type.
val df = raw.select(((1 to 28).map(i => "V" + i) ++ Array("Time", "Amount", "Class")).map(s => col(s).cast("Double")): _*)

链接：https://github.com/intel-analytics/analytics-zoo/blob/master/apps/fraudDetection/Fraud%20Detction.ipynb

Scala＆amp; Spark：一次抛出多个列

4 个答案: