如何在spark-jdbc应用程序中提供表名以读取RDBMS数据库上的数据?

时间:2018-12-18 14:07:05

标签: apache-spark greenplum

我正在尝试使用spark读取greenplum数据库中存在的表,如下所示:

const

当我使用spark-submit运行代码时,出现异常:

val execQuery = s"select ${allColumns}, 0 as ${flagCol} from schema.table where period_year=2017 and period_num=12"
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", connectionUrl).option("dbtable", s"(${execQuery}) as year2016")
                                .option("user", devUserName)
                                .option("password", devPassword)
                                .option("partitionColumn","header_id")
                                .option("lowerBound", 16550)
                                .option("upperBound", 1152921481695656862L)
                                .option("numPartitions",450).load()

Exception in thread "main" org.postgresql.util.PSQLException: ERROR: relation "public.(select je_header_id,source_system_name,je_line_num,last_update" does not exist Position: 15 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2310) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2023) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:217) at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:421) at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:318) at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:281) at com.zaxxer.hikari.pool.ProxyStatement.executeQuery(ProxyStatement.java:111) at com.zaxxer.hikari.pool.HikariProxyStatement.executeQuery(HikariProxyStatement.java) at io.pivotal.greenplum.spark.jdbc.Jdbc$.resolveTable(Jdbc.scala:301) at io.pivotal.greenplum.spark.GreenplumRelationProvider.createRelation(GreenplumRelationProvider.scala:29) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:309) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146) at com.partition.source.YearPartition$.prepareFinalDF$1(YearPartition.scala:141) at com.partition.source.YearPartition$.main(YearPartition.scala:164) at com.partition.source.YearPartition.main(YearPartition.scala) 中,我可以看到格式名称和表名称正确形成。当我提交代码时,它显示为execQuery。我不明白为什么要以public.(select je_header_id,source_system_name,) relation not found作为架构名称和查询public 作为表名。

有人可以让我知道我在这里做的错误是什么以及如何解决吗?

1 个答案:

答案 0 :(得分:1)

如果使用spark jdbc,则可以包装查询并将其传递给dbtable参数。如果“关键”就像任何jdbc一样起作用,则应该起作用。

val query = """
  (select a.id,b,id,a.name from a left outer join b on a.id=b.id
    limit 100) foo
"""

val df = sqlContext.format("jdbc").
  option("url", "jdbc:mysql://localhost:3306/local_content").
  option("driver", "com.mysql.jdbc.Driver").
  option("useUnicode", "true").
  option("continueBatchOnError","true").
  option("useSSL", "false").
  option("user", "root").
  option("password", "").
  option("dbtable",query).
  load()