Question

我使用Spark从MapR DB中读取一个表。但是timestamp列被推断为InvalidType。从Mapr db读取数据时，也没有设置模式的选项。

root
 |-- Name: string (nullable = true)
 |-- dt: struct (nullable = true)
 |    |-- InvalidType: string (nullable = true)

我尝试将列强制转换为时间戳，但出现以下异常。

 val df = spark.loadFromMapRDB("path")
df.withColumn("dt1", $"dt" ("InvalidType").cast(TimestampType))     
  .drop("dt")
df.show(5, false)

com.mapr.db.spark.exceptions.SchemaMappingException：架构不能为为{dt}列推断在com.mapr.db.spark.sql.utils.MapRSqlUtils $ .convertField（MapRSqlUtils.scala：250）在com.mapr.db.spark.sql.utils.MapRSqlUtils $ .convertObject（MapRSqlUtils.scala：64）在com.mapr.db.spark.sql.utils.MapRSqlUtils $ .convertRootField（MapRSqlUtils.scala：48）在com.mapr.db.spark.sql.utils.MapRSqlUtils $$ anonfun $ documentsToRow $ 1.apply（MapRSqlUtils.scala：34） com.mapr.db.spark.sql.utils.MapRSqlUtils $$ anonfun $ documentsToRow $ 1.apply（MapRSqlUtils.scala：33）在scala.collection.Iterator $$ anon $ 12.nextCur（Iterator.scala：434）在scala.collection.Iterator $$ anon $ 12.hasNext（Iterator.scala：440）在scala.collection.Iterator $$ anon $ 11.hasNext（Iterator.scala：408）在org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext（未知资源）在org.apache.spark.sql.execution.BufferedRowIterator.hasNext（BufferedRowIterator.java:43）在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ anon $ 1.hasNext（WholeStageCodegenExec.scala：395）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：234）在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply（SparkPlan.scala：228）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $ anonfun $ apply $ 25.apply（RDD.scala：827）在org.apache.spark.rdd.MapPartitionsRDD.compute（MapPartitionsRDD.scala：38）在org.apache.spark.rdd.RDD.computeOrReadCheckpoint（RDD.scala：323）在org.apache.spark.rdd.RDD.iterator（RDD.scala：287）在org.apache.spark.scheduler.ResultTask.runTask（ResultTask.scala：87）在org.apache.spark.scheduler.Task.run（Task.scala：108）在org.apache.spark.executor.Executor $ TaskRunner.run（Executor.scala：338）在java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1149）在java.util.concurrent.ThreadPoolExecutor $ Worker.run（ThreadPoolExecutor.java:624）在java.lang.Thread.run（Thread.java:748）

任何帮助将不胜感激。

Answer 1

如果您知道表的架构。您可以创建自己的案例类来定义表的架构，然后使用此案例类加载表。

通过此链接Loading Data from MapR Database as an Apache Spark Dataset

还要检查MapRDB中的表是否该特定列具有有效的模式

从Mapr数据库表中根据InvalidType推断出的Spark数据帧时间戳列

1 个答案: