spark将字符串转换为TimestampType

时间:2018-05-02 19:28:34

标签: postgresql datetime apache-spark jdbc

我有一个数据帧,我想在spark中插入Postgresql。在spark中,DateTimestamp列是字符串格式。在postgreSQL中,它是没有时区的TimeStamp。

在日期时间列中插入数据库时​​出错。我确实尝试更改数据类型,但插入仍然出错。我无法弄清楚为什么演员阵容不起作用。如果我将相同的插入字符串粘贴到PgAdmin并运行,则insert语句运行正常。

import java.text.SimpleDateFormat;
import java.util.Calendar
object EtlHelper {
 // Return the current time stamp

  def getCurrentTime() : String = {    
    val now = Calendar.getInstance().getTime()   
    val hourFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")   
    return hourFormat.format(now)   
  }
 }  

在另一个档案中

object CreateDimensions {

def createDimCompany(spark:SparkSession, location:String, propsLocation :String):Unit = {      
import spark.implicits._    

val dimCompanyStartTime = EtlHelper.getCurrentTime()
val dimcompanyEndTime = EtlHelper.getCurrentTime()
val prevDimCompanyId = 2
val numRdd = 27
val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd,dimCompanyStartTime,dimcompanyEndTime))).toDF("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date")//.show()

AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
AuditDF.withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast(DataTypes.TimestampType))

AuditDF.printSchema()
}  
}

root
 |-- audit_tbl_name: string (nullable = true)
 |-- audit_tbl_id: long (nullable = false)
 |-- audit_no_rows: long (nullable = false)
 |-- audit_tbl_start_date: string (nullable = true)
 |-- audit_tbl_end_date: string (nullable = true)

这是我得到的错误

INSERT INTO etl.audit_master ("audit_tbl_name","audit_tbl_id","audit_no_rows","audit_tbl_start_date","audit_tbl_end_date") VALUES ('dim_company',27,2,'2018-05-02 12:15:54','2018-05-02 12:15:59') was aborted: ERROR: column "audit_tbl_start_date" is of type timestamp without time zone but expression is of type character varying
  Hint: You will need to rewrite or cast the expression.

感谢任何帮助。

谢谢

2 个答案:

答案 0 :(得分:2)

AuditDF.printSchema()正在使用原始AuditDF数据框,因为您未通过分配保存.withColumn的转换。 数据帧是不可变对象,可以转换为另一个数据帧,但不能自行更改。因此,您始终需要一个作业来保存您已应用的转换。

所以正确的方法是分配以保存更改

val transformedDF = AuditDF.withColumn("audit_tbl_start_date",AuditDF.col("audit_tbl_start_date").cast(DataTypes.TimestampType))
                          .withColumn("audit_tbl_end_date",AuditDF.col("audit_tbl_end_date").cast("timestamp"))

transformedDF.printSchema()

您将看到更改

root
 |-- audit_tbl_name: string (nullable = true)
 |-- audit_tbl_id: integer (nullable = false)
 |-- audit_no_rows: integer (nullable = false)
 |-- audit_tbl_start_date: timestamp (nullable = true)
 |-- audit_tbl_end_date: timestamp (nullable = true)

.cast(DataTypes.TimestampType).cast("timestamp")都是相同的

答案 1 :(得分:1)

问题的根源是@Ramesh提到的,即您没有将AuditDF中的更改分配给新值(val)请注意,数据框和您分配给它的值都是不可变的(即auditDF定义为val,因此也无法更改

另一件事是你不需要重新发明轮子并使用EtlHelper spark内置函数为你提供当前时间的时间戳:

import org.apache.spark.sql.functions._

val AuditDF = spark.createDataset(Array(("dim_company", prevDimCompanyId,numRdd)))
.toDF("audit_tbl_name","audit_tbl_id","audit_no_rows")
.withColumn("audit_tbl_start_date"current_timestamp())
.withColumn("audit_tbl_end_date",current_timestamp())