Question

我有两个数据帧dfA和dfB。通过读取CSV文件使用spark.read.load创建dfA。通过从数据库表中读取来使用spark.read.format("com.databricks.spark.redshift")创建dfB。

然后我分别使用registerTempTable基于dfA和dfB创建了两个表tblA和tblB。

SELECT Count(*) from tblA;和SELECT Count(*) from tblB;都有效。但是，

SELECT count(*) from (select distinct * from tblA) as temp;

有效，但

SELECT count(*) from (select distinct * from tblB) as temp;

不起作用。

看起来这是一个Redshift驱动程序问题。引用tblB时的错误消息是

com.databricks.backend.common.rpc.DatabricksExceptions $ SQLExecutionException：java.sql.SQLException：awaitResult中抛出异常： at com.databricks.spark.redshift.JDBCWrapper.com $ databricks $ spark $ redshift $ JDBCWrapper $$ executeInterruptibly（RedshiftJDBCWrapper.scala：150） at com.databricks.spark.redshift.JDBCWrapper.executeInterruptibly（RedshiftJDBCWrapper.scala：124） at com.databricks.spark.redshift.RedshiftRelation.getRDDFromS3（RedshiftRelation.scala：175）在com.databricks.spark.redshift.RedshiftRelation.buildScan（RedshiftRelation.scala：157）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply（DataSourceStrategy.scala：336）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply（DataSourceStrategy.scala：336）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply（DataSourceStrategy.scala：384）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply（DataSourceStrategy.scala：383）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProjectRaw（DataSourceStrategy.scala：475）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProject（DataSourceStrategy.scala：379）在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .apply（DataSourceStrategy.scala：332）在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply（QueryPlanner.scala：62）在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply（QueryPlanner.scala：62）在scala.collection.Iterator $$ anon $ 12.nextCur（Iterator.scala：434）在scala.collection.Iterator $$ anon $ 12.hasNext（Iterator.scala：440）在scala.collection.Iterator $$ anon $ 12.hasNext（Iterator.scala：439）在org.apache.spark.sql.catalyst.planning.QueryPlanner.plan（QueryPlanner.scala：92）在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply（QueryPlanner.scala：77）在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply（QueryPlanner.scala：74）在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply（TraversableOnce.scala：157）在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply（TraversableOnce.scala：157）在scala.collection.Iterator $ class.foreach（Iterator.scala：893）在scala.collection.AbstractIterator.foreach（Iterator.scala：1336）在scala.collection.TraversableOnce $ class.foldLeft（TraversableOnce.scala：157）在scala.collection.AbstractIterator.foldLeft（Iterator.scala：1336）

Spark数据帧并不完全相同

0 个答案: