将dataframe中的字符串数据转换为double

时间:2017-01-02 15:43:28

标签: scala apache-spark apache-spark-sql

我有一个包含double类型的csv文件。当我加载到数据帧时,我收到此消息告诉我类型字符串是java.lang.String不能强制转换为java.lang.Double虽然我的数据是数字。如何从这个包含double type的csv文件中获取数据帧。我应该修改我的代码。

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{ArrayType, DoubleType}
import org.apache.spark.sql.functions.split
import scala.collection.mutable._

object Example extends App {

val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data=spark.read.csv("C://lpsa.data").toDF("col1","col2","col3","col4","col5","col6","col7","col8","col9")
val data2=data.select("col2","col3","col4","col5","col6","col7")

我可以将数据框中的每一行转换为double类型?感谢

2 个答案:

答案 0 :(得分:6)

selectcast

一起使用
import org.apache.spark.sql.functions.col

data.select(Seq("col2", "col3", "col4", "col5", "col6", "col7").map(
  c => col(c).cast("double")
): _*)

或将架构传递给读者:

  • 定义架构:

    import org.apache.spark.sql.types._
    
    val cols = Seq(
      "col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9"
    )
    
    val doubleCols = Set("col2", "col3", "col4", "col5", "col6", "col7")
    
    val schema =  StructType(cols.map(
      c => StructField(c, if (doubleCols contains c) DoubleType else StringType)
    ))
    
  • 并将其用作schema方法

    的参数
    spark.read.schema(schema).csv(path)
    

也可以使用模式推理:

spark.read.option("inferSchema", "true").csv(path)

但它要贵得多。

答案 1 :(得分:1)

我相信在阅读csv文件时使用spark inferSchema选项会派上用场。下面是自动检测列为double类型的代码:

val data = spark.read
                .format("csv")
                .option("header", "false")
                .option("inferSchema", "true")
                .load("C://lpsa.data").toDF()


Note: I am using spark version 2.2.0