我有一个数据框,其中有几列具有日期数据。我想对列应用验证,如果日期错误,我想用错误消息列更新该数据框。我已经尝试过但无法正常工作。 我的样本数据框数据。
val DATE_TIME_FORMAT = "MM-dd-yy"
def validateDf(row: Row): Boolean = try {
//assume row.getString(1) with give Datetime string
java.time.LocalDateTime.parse(row.getString(2), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
true
} catch {
case ex: java.time.format.DateTimeParseException => {
// Handle exception if you want
false
}
}
val validDf = sample1.filter(validateDf(_))
val inValidDf = sample1.except(validDf)
我尝试了以下代码。
+-------+-----+-----------+-------------+-------------+
|AirName|Place|TakeoffDate|arriveoffDate|error message|
+-------+-----+-----------+-------------+-------------+
| Delta| Aus| 11/16/18| 08/06/19| |
| Delta| Pak| 11/16/18| 08/06/19| |
| Vistra| New| 11/16/18| 15/06/19|wrong date |
| Delta| Aus| 15/16/18| 08/06/19|wrong date |
| JetAir| Aus| 11/16/18| null| |
+-------+-----+-----------+-------------+-------------+
预期数据框
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>com.myorg.common</groupId>
<artifactId>myorg-starter-parent</artifactId>
<version>1.0.0</version>
</parent>
<artifactId>test-project</artifactId>
<version>1.0.0</version>
<name>test-service</name>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
</dependencies>
<build>
<extensions>
<extension>
<groupId>com.gkatzioura.maven.cloud</groupId>
<artifactId>google-storage-wagon</artifactId>
<version>1.0</version>
</extension>
</extensions>
</build>
<repositories>
<repository>
<id>com.myorg.common</id>
<url>gs://myorg-library</url>
</repository>
</repositories>
</project>
答案 0 :(得分:1)
我建议使用用户定义功能(UDF)。
这是一个例子:
测试数据框
val someDF = Seq(
("11/16/18", "Aus"),
("15/16/18", "Pak"),
("11/16/18", "New")
).toDF("TakeoffDate", "Place")
UDF
import org.apache.spark.sql.functions.udf
def isValidDate = udf((A: String) => {
val DATE_TIME_FORMAT = "MM/dd/yy"
try{
java.time.LocalDate.parse(A, java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
true
} catch {
case ex: java.time.format.DateTimeParseException =>
false
}
})
请注意,我使用的是LocalDate
而不是LocalDateTime
。
用法:
someDF.withColumn("IsValidDate", isValidDate(someDF("TakeoffDate"))).show()
结果:
+-----------+-----+-----------+
|TakeoffDate|Place|IsValidDate|
+-----------+-----+-----------+
| 11/16/18| Aus| true|
| 15/16/18| Pak| false|
| 11/16/18| New| true|
+-----------+-----+-----------+
希望有帮助。
致谢。