Spark Dataframe数据类型为String

时间:2018-07-16 14:06:39

标签: java apache-spark dataframe apache-spark-sql

我正尝试通过将describe编写为SQL查询来验证DataFrame的数据类型,但是每次我将datetime作为字符串获取时。

1。首先,我尝试使用以下代码:

    SparkSession sparkSession=new SparkSession.Builder().getOrCreate();
        Dataset<Row> df=sparkSession.read().option("header","true").option("inferschema","true").format("csv").load("/user/data/*_ecs.csv");

        try {
    df.createTempView("data");
    Dataset<Row> sqlDf=sparkSession.sql("Describe data");
    sqlDf.show(300,false);

    Output:
    +-----------------+---------+-------+
    |col_name         |data_type|comment|
    +-----------------+---------+-------+
    |id               |int      |null   |
    |symbol           |string   |null   |
    |datetime         |string   |null   |
    |side             |string   |null   |
    |orderQty         |int      |null   |
    |price            |double   |null   | 
    +-----------------+---------+-------+
  1. 我也尝试了自定义模式,但在这种情况下,我执行除描述表以外的任何查询时都会遇到异常:

    SparkSession sparkSession=new SparkSession.Builder().getOrCreate(); Dataset<Row>df=sparkSession.read().option("header","true").schema(customeSchema).format("csv").load("/use/data/*_ecs.csv");
     try {
                df.createTempView("trade_data");
        Dataset<Row> sqlDf=sparkSession.sql("Describe trade_data");
        sqlDf.show(300,false);
    
    Output:
    +--------+---------+-------+
    |col_name|data_type|comment|
    +--------+---------+-------+
    |datetime|timestamp|null   |
    |price   |double   |null   |
    |orderQty|double   |null   |
    +--------+---------+-------+
    

但是,如果我尝试任何查询,则会得到以下执行:

Dataset<Row> sqlDf=sparkSession.sql("select DATE(datetime),avg(price),avg(orderQty) from data group by datetime");


java.lang.IllegalArgumentException
        at java.sql.Date.valueOf(Date.java:143)
        at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:137)

如何解决?

1 个答案:

答案 0 :(得分:0)

  1. 为什么Inferschema无法正常工作?

  2. 如果您不想提交自己的架构,一种方法是:

    Dataset<Row> df = sparkSession.read().format("csv").option("header","true").option("inferschema", "true").load("example.csv");
    
    df.printSchema();  // check output - 1
    df.createOrReplaceTempView("df");
    Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
    df1.printSchema();  // check output - 2
    
    ====================================
    
    output - 1:
    root
     |-- id: integer (nullable = true)
     |-- symbol: string (nullable = true)
     |-- datetime: string (nullable = true)
     |-- side: string (nullable = true)
     |-- orderQty: integer (nullable = true)
     |-- price: double (nullable = true)
    
    output - 2:
    root
     |-- id: integer (nullable = true)
     |-- symbol: string (nullable = true)
     |-- side: string (nullable = true)
     |-- orderQty: integer (nullable = true)
     |-- price: double (nullable = true)
     |-- datetime_d: date (nullable = true)
    

    如果要投射的字段数量不多,我会选择此方法。

  3. 如果要提交自己的架构:

    List<org.apache.spark.sql.types.StructField> fields = new ArrayList<>();
    fields.add(DataTypes.createStructField("datetime", DataTypes.TimestampType, true));
    fields.add(DataTypes.createStructField("price",DataTypes.DoubleType,true));
    fields.add(DataTypes.createStructField("orderQty",DataTypes.DoubleType,true));
    StructType schema = DataTypes.createStructType(fields);
    Dataset<Row> df = sparkSession.read().format("csv").option("header", "true").schema(schema).load("example.csv");
    
    df.printSchema(); // output - 1
    
    df.createOrReplaceTempView("df");
    Dataset<Row> df1 = sparkSession.sql("select * , Date(datetime) as datetime_d from df").drop("datetime");
    df1.printSchema(); // output - 2
    
    ======================================
    output - 1:
    root
     |-- datetime: timestamp (nullable = true)
     |-- price: double (nullable = true)
     |-- orderQty: double (nullable = true)
    
    output - 2:
    root
     |-- price: double (nullable = true)
     |-- orderQty: double (nullable = true)
     |-- datetime_d: date (nullable = true)
    

    由于它是从时间戳到日期的重新转换列,因此我没有看到这种方法的太多使用。但是仍然可以将其放在这里以备将来使用。