如何在Dataframe scala中处理像00000000(yyyyMMdd)这样的日期?

时间:2017-08-24 12:50:38

标签: scala dataframe spark-dataframe

我是scala的新手,我有test.csv文件,我能读取数据帧,但我在csv文件中有一个DATE列,它包含日期“00000000”。如何处理日期,如“00000000” ”。 如果日期像“00000000”我怎么能用null替换。任何帮助将不胜感激。

test.csv

TYPE,CODE,SQ_CODE,RE_TYPE,VERY_ID,IN_DATE,DATE
"F","000544","2017002","OP","95032015062763298","20150610","00000000"
"F","000544","2017002","LD","95032015062763261","20150611","20150519"
"F","000544","2017002","AK","95037854336743246","20150611","20150429"   

 val sparkSession = SparkSession.builder().master("local").appName("IT_DATA").getOrCreate()


          //Create dataframe boject
            val df = sparkSession.read
            .format("com.databricks.spark.csv")
            .option("header", "true")
            .option("inferSchema", "true")
            .option("location", "/xx/xx/xx/xx/test.csv") 
            .option("delimiter", ",")
            .option("dateFormat", "yyyyMMdd")
            .load().cache()

df.printSchema
root
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
 |-- SQ_CODE: string (nullable = true)
 |-- RE_TYPE: string (nullable = true)
 |-- VERY_ID: long (nullable = true)
 |-- IN_DATE: date (nullable = true)
 |-- DATE: date (nullable = true)

df.show

实际输出

+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE|      DATE   |
+-----+-------+---------+---------+-------------------+---------+-------------+
|   F | 000544|  2017002|      OP |  95032015062763298| 20150610|   0002-11-30|
|   F | 000544|  2017002|      LD |  95032015062763261| 20150611|   2015-05-19|
|   F | 000544|  2017002|      AK |  95037854336743246| 20150611|   2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+

预期产出

    +-----+-------+---------+---------+-------------------+---------+-------------+
    | TYPE|   CODE|  SQ_CODE| RE_TYPE |            VERY_ID|  IN_DATE|      DATE   |
    +-----+-------+---------+---------+-------------------+---------+-------------+
    |   F | 000544|  2017002|      OP |  95032015062763298| 20150610|        null |
    |   F | 000544|  2017002|      LD |  95032015062763261| 20150611|   2015-05-19|
    |   F | 000544|  2017002|      AK |  95037854336743246| 20150611|   2015-04-29|
    +-----+-------+---------+--+------+-------------------+---------+-------------+

1 个答案:

答案 0 :(得分:1)

“00000000”是根据您的dateFormat的有效日期格式。因此,除非您更改“null”值,否则spark无法将此视为null。

在csv阅读器中,您可以指定它:

nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.

您可以将转换应用于DF,以便在超出有效范围时将其转换回null

val df = List((1,"0000"), (2, "12345")).toDF("id", "value")
df.withColumn("value", when(df("value") === "0000", null).otherwise(df("value"))).show

由于您的日期已经解析为某些值,因此您可以使用以下简单函数对其进行过滤:

df.withColumn("date", when(year(df("date")) < 1900, null).otherwise(df("value"))).show