如何验证日期框架的日期列

时间:2019-06-04 12:55:08

标签: scala apache-spark

我有一个数据框,其中有几列具有日期数据。我想对列应用验证,如果日期错误,我想用错误消息列更新该数据框。我已经尝试过但无法正常工作。 我的样本数据框数据。

val DATE_TIME_FORMAT = "MM-dd-yy"

  def validateDf(row: Row): Boolean = try {
    //assume row.getString(1) with give Datetime string
    java.time.LocalDateTime.parse(row.getString(2), java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
    true
  } catch {
    case ex: java.time.format.DateTimeParseException => {
      // Handle exception if you want
      false
    }
  }

val validDf = sample1.filter(validateDf(_))
val inValidDf = sample1.except(validDf)

我尝试了以下代码。

+-------+-----+-----------+-------------+-------------+
|AirName|Place|TakeoffDate|arriveoffDate|error message|
+-------+-----+-----------+-------------+-------------+
|  Delta|  Aus|   11/16/18|     08/06/19|             |
|  Delta|  Pak|   11/16/18|     08/06/19|             |
| Vistra|  New|   11/16/18|     15/06/19|wrong date   |
|  Delta|  Aus|   15/16/18|     08/06/19|wrong date   |
| JetAir|  Aus|   11/16/18|         null|             |
+-------+-----+-----------+-------------+-------------+

预期数据框

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>com.myorg.common</groupId>
        <artifactId>myorg-starter-parent</artifactId>
        <version>1.0.0</version>
    </parent>

    <artifactId>test-project</artifactId>
    <version>1.0.0</version>
    <name>test-service</name>


    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
    </dependencies>
    <build>
        <extensions>
            <extension>
                <groupId>com.gkatzioura.maven.cloud</groupId>
                <artifactId>google-storage-wagon</artifactId>
                <version>1.0</version>
            </extension>
        </extensions>
    </build>
    <repositories>
        <repository>
            <id>com.myorg.common</id>
            <url>gs://myorg-library</url>
        </repository>
    </repositories>
</project>

1 个答案:

答案 0 :(得分:1)

我建议使用用户定义功能(UDF)。

这是一个例子:

测试数据框

val someDF = Seq(
  ("11/16/18", "Aus"),
  ("15/16/18", "Pak"),
  ("11/16/18", "New")
).toDF("TakeoffDate", "Place")

UDF

import org.apache.spark.sql.functions.udf

def isValidDate = udf((A: String) => {

  val DATE_TIME_FORMAT = "MM/dd/yy"

  try{
    java.time.LocalDate.parse(A, java.time.format.DateTimeFormatter.ofPattern(DATE_TIME_FORMAT))
    true
  } catch {
    case ex: java.time.format.DateTimeParseException => 
      false         
  }
})

请注意,我使用的是LocalDate而不是LocalDateTime

用法:

someDF.withColumn("IsValidDate", isValidDate(someDF("TakeoffDate"))).show()

结果:

+-----------+-----+-----------+
|TakeoffDate|Place|IsValidDate|
+-----------+-----+-----------+
|   11/16/18|  Aus|       true|
|   15/16/18|  Pak|      false|
|   11/16/18|  New|       true|
+-----------+-----+-----------+

希望有帮助。

致谢。