Spark数据帧并不完全相同

时间:2018-05-04 01:26:03

标签: apache-spark pyspark apache-spark-sql amazon-redshift databricks

我有两个数据帧dfA和dfB。通过读取CSV文件使用spark.read.load创建dfA。通过从数据库表中读取来使用spark.read.format("com.databricks.spark.redshift")创建dfB。

然后我分别使用registerTempTable基于dfA和dfB创建了两个表tblA和tblB。

SELECT Count(*) from tblA;SELECT Count(*) from tblB;都有效。但是,

SELECT count(*) from (select distinct * from tblA) as temp;

有效,但

SELECT count(*) from (select distinct * from tblB) as temp;

不起作用。

看起来这是一个Redshift驱动程序问题。引用tblB时的错误消息是

  

com.databricks.backend.common.rpc.DatabricksExceptions $ SQLExecutionException:java.sql.SQLException:awaitResult中抛出异常:       at com.databricks.spark.redshift.JDBCWrapper.com $ databricks $ spark $ redshift $ JDBCWrapper $$ executeInterruptibly(RedshiftJDBCWrapper.scala:150)       at com.databricks.spark.redshift.JDBCWrapper.executeInterruptibly(RedshiftJDBCWrapper.scala:124)       at com.databricks.spark.redshift.RedshiftRelation.getRDDFromS3(RedshiftRelation.scala:175)       在com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:157)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply(DataSourceStrategy.scala:336)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ 11.apply(DataSourceStrategy.scala:336)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply(DataSourceStrategy.scala:384)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $$ anonfun $ pruneFilterProject $ 1.apply(DataSourceStrategy.scala:383)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProjectRaw(DataSourceStrategy.scala:475)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .pruneFilterProject(DataSourceStrategy.scala:379)       在org.apache.spark.sql.execution.datasources.DataSourceStrategy $ .apply(DataSourceStrategy.scala:332)       在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply(QueryPlanner.scala:62)       在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 1.apply(QueryPlanner.scala:62)       在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)       在scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)       在scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:439)       在org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)       在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply(QueryPlanner.scala:77)       在org.apache.spark.sql.catalyst.planning.QueryPlanner $$ anonfun $ 2 $$ anonfun $ apply $ 2.apply(QueryPlanner.scala:74)       在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply(TraversableOnce.scala:157)       在scala.collection.TraversableOnce $$ anonfun $ foldLeft $ 1.apply(TraversableOnce.scala:157)       在scala.collection.Iterator $ class.foreach(Iterator.scala:893)       在scala.collection.AbstractIterator.foreach(Iterator.scala:1336)       在scala.collection.TraversableOnce $ class.foldLeft(TraversableOnce.scala:157)       在scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)

0 个答案:

没有答案