我有两个数据帧dfA和dfB。通过读取CSV文件使用spark.read.load创建dfA。通过从数据库表中读取来使用spark.read.format("com.databricks.spark.redshift")
创建dfB。
然后我分别使用registerTempTable
基于dfA和dfB创建了两个表tblA和tblB。
SELECT Count(*) from tblA;
和SELECT Count(*) from tblB;
都有效。但是,
SELECT count(*) from (select distinct * from tblA) as temp;
有效,但
SELECT count(*) from (select distinct * from tblB) as temp;
不起作用。
看起来这是一个Redshift驱动程序问题。引用tblB时的错误消息是
com.databricks.backend.common.rpc.DatabricksExceptions $ SQLExecutionException:java.sql.SQLException:awaitResult中抛出异常: at com.databricks.spark.redshift.JDBCWrapper.com $ databricks $ spark $ redshift $ JDBCWrapper $$ executeInterruptibly(RedshiftJDBCWrapper.scala:150) at com.databricks.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:124) at com.databricks.spark.redshift.RedshiftRelation.getRDDFromS3(RedshiftRelation.scala:175) 在com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:157) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply(DataSourceStrategy.scala:336) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply(DataSourceStrategy.scala:336) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply(DataSourceStrategy.scala:384) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply(DataSourceStrategy.scala:383) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProjectRaw(DataSourceStrategy.scala:475) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProject(DataSourceStrategy.scala:379) 在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .apply(DataSourceStrategy.scala:332) 在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply(QueryPlanner.scala:62) 在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply(QueryPlanner.scala:62) 在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434) 在scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440) 在scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:439) 在org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92) 在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply(QueryPlanner.scala:77) 在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply(QueryPlanner.scala:74) 在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply(TraversableOnce.scala:157) 在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply(TraversableOnce.scala:157) 在scala.collection.Iterator $ class.foreach(Iterator.scala:893) 在scala.collection.AbstractIterator.foreach(Iterator.scala:1336) 在scala.collection.TraversableOnce $ class.foldLeft(TraversableOnce.scala:157) 在scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)