Question

使用Apache Spark 1.6.0

我有两个数据库：phoenix和postgres，它们的记录数相同，并且从postgres表加载数据时出现性能问题。

要从凤凰表创建数据框，我使用方法org.apache.phoenix.spark.SparkSqlContextFunctions phoenixTableAsDataFrame：

implicit val sc: SparkContext = new SparkContext(new SparkConf().setAppName("App"))
sc.hadoopConfiguration.set("hbase.zookeeper.quorum", "jdbc:phoenix:node01,node02,node03:2181")

implicit val sqlc = new SQLContext(sc)
def findAllPH: DataFrame = {
    val predicate: Option[String] = Some(s" operationTime > TO_TIMESTAMP( '30/01/2017 00:00:00' ,'dd/MM/yyyy HH:mm:ss') ");
    val df = sqlc.phoenixTableAsDataFrame(
      TableName, TableColumns, predicate, conf = configuration
    ).select("custumer_id")
    df
}

要从postgres创建数据框：

    def findAllPS(): DataFrame = {
      val df = sqlc.read
        .format("jdbc")
        .options(
          Map("url" -> jdbc:postgresql://server_db:5432/db_name,
            "driver" -> "org.postgresql.Driver",
            "dbtable" -> s"(select id,name,surname from custumer_table ) t ",
            "user" -> "user",
            "password" -> "password").load().select("id","name","surname")
    df
  }

当我打电话给findAllPH.where("custumer_id = '100000'").show(1)时需要花费几秒钟的时间（感谢 DagScheduler ）

当我致电findAllPS.where("id = '100000'").show(1)时，花了几分钟的时间，因为spark在按ID过滤之前加载了所有记录（似乎没有DagScheduler）

因此，如果我执行数据框sql连接： findAllPH join findAllPS on id = and custumer_id，需要很多时间，但是在custumer_id上的findAllPH之间进行selfJoin需要花费一些时间

有没有一种方法可以使PS用作PHOENIX？

该问题的第一个解决方案是检索所有id：

ids = findAllPH.rdd.collect.as[String]

postgres查询，例如：

"dbtable" -> (select id,name,surname from custumer_table where id in (ids(1), ids(2), .... ids(N) ))

速度更快，但不如预期，因为该函数的收集非常昂贵

从Posgres表加载Spark Dataframe

0 个答案: