我是scala的新手,我有test.csv文件,我能读取数据帧,但我在csv文件中有一个DATE列,它包含日期“00000000”。如何处理日期,如“00000000” ”。 如果日期像“00000000”我怎么能用null替换。任何帮助将不胜感激。
test.csv
TYPE,CODE,SQ_CODE,RE_TYPE,VERY_ID,IN_DATE,DATE
"F","000544","2017002","OP","95032015062763298","20150610","00000000"
"F","000544","2017002","LD","95032015062763261","20150611","20150519"
"F","000544","2017002","AK","95037854336743246","20150611","20150429"
val sparkSession = SparkSession.builder().master("local").appName("IT_DATA").getOrCreate()
//Create dataframe boject
val df = sparkSession.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("location", "/xx/xx/xx/xx/test.csv")
.option("delimiter", ",")
.option("dateFormat", "yyyyMMdd")
.load().cache()
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: date (nullable = true)
|-- DATE: date (nullable = true)
df.show
实际输出
+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE| DATE |
+-----+-------+---------+---------+-------------------+---------+-------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610| 0002-11-30|
| F | 000544| 2017002| LD | 95032015062763261| 20150611| 2015-05-19|
| F | 000544| 2017002| AK | 95037854336743246| 20150611| 2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+
预期产出
+-----+-------+---------+---------+-------------------+---------+-------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE| DATE |
+-----+-------+---------+---------+-------------------+---------+-------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610| null |
| F | 000544| 2017002| LD | 95032015062763261| 20150611| 2015-05-19|
| F | 000544| 2017002| AK | 95037854336743246| 20150611| 2015-04-29|
+-----+-------+---------+--+------+-------------------+---------+-------------+
答案 0 :(得分:1)
“00000000”是根据您的dateFormat的有效日期格式。因此,除非您更改“null”值,否则spark无法将此视为null。
在csv阅读器中,您可以指定它:
nullValue (default empty string): sets the string representation of a null value. Since 2.0.1, this applies to all supported types including the string type.
您可以将转换应用于DF,以便在超出有效范围时将其转换回null
val df = List((1,"0000"), (2, "12345")).toDF("id", "value")
df.withColumn("value", when(df("value") === "0000", null).otherwise(df("value"))).show
由于您的日期已经解析为某些值,因此您可以使用以下简单函数对其进行过滤:
df.withColumn("date", when(year(df("date")) < 1900, null).otherwise(df("value"))).show