用于Kudu兼容性的Spark Dataframe Cast列

时间:2019-05-15 19:27:07

标签: scala apache-spark impala apache-kudu

(我是Spark,Impala和Kudu的新手。)我试图通过Kudu将表从Oracle DB复制到具有相同结构的Impala表中。代码尝试将Oracle NUMBER映射到Kudu数据类型时出现错误。如何更改Spark DataFrame的数据类型以使其与Kudu兼容?

这旨在成为从Oracle到Impala的数据的1对1副本。我已经提取了源表的Oracle模式,并创建了具有相同结构(相同的列名和合理的数据类型映射)的目标Impala表。我希望Spark + Kudu能够自动映射所有内容并仅复制数据。相反,Kudu抱怨它无法映射DecimalType(38,0)

我想指定“应该将名称为SOME_COL的列#1(在Oracle中为NUMBER映射到LongType,在Kudu中受支持”。

我该怎么做?

// This works
val df: DataFrame = spark.read
  .option("fetchsize", 10000)
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .jdbc("jdbc:oracle:thin:@(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)

// This does not work  
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala

// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark  data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu   data type: SOME_COL BIGINT

1 个答案:

答案 0 :(得分:1)

显然,从JDBC数据源读取数据时,我们可以specify a custom schema

connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
  .jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)

行得通。我可以这样指定customSchema

col1 Long, col2 Timestamp, col3 Double, col4 String

并由此起作用:

import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
  .option("fetchsize", 10000)
  .option("driver", "oracle.jdbc.driver.OracleDriver")
  .jdbc("jdbc:oracle:thin:@(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
  .as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")