Spark如何从Phoenix加载数据?

时间:2018-07-16 10:25:12

标签: apache-spark phoenix

如果我执行以下代码

spark.read.format("org.apache.phoenix.spark") \
            .option("table", "data_table") \
            .option("zkUrl", zkUrl) \
            .load().createOrReplaceTempView("table")

spark.sql("select * from table where date>='2018-07-02 00:00:00' and date<'2018-07-04 00:00:00'").createTempView("c")
spark.sql("select * from c").explain(True)

然后我得到以下解释

== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `c`

== Analyzed Logical Plan ==
DATE: timestamp, ID: string, SESSIONID: string, IP: string, NAME: string, BYTES_SENT: int, DELTA: int, W_ID: string
Project [DATE#48, ID#49, SESSIONID#50, IP#51, NAME#52, BYTES_SENT#53, DELTA#54, W_ID#55]
+- SubqueryAlias c
   +- Project [DATE#48, ID#49, SESSIONID#50, IP#51, NAME#52, BYTES_SENT#53, DELTA#54, W_ID#55]
      +- Filter ((cast(date#48 as string) >= 2018-07-02 00:00:00) && (cast(date#48 as string) < 2018-07-04 00:00:00))
         +- SubqueryAlias table
            +- Relation[DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false)

== Optimized Logical Plan ==
Filter ((isnotnull(date#48) && (cast(date#48 as string) >= 2018-07-02 00:00:00)) && (cast(date#48 as string) < 2018-07-04 00:00:00))
+- Relation[DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false)

== Physical Plan ==
*(1) Filter (((cast(DATE#48 as string) >= 2018-07-02 00:00:00) && (cast(DATE#48 as string) < 2018-07-04 00:00:00)) && isnotnull(DATE#48))
+- *(1) Scan PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false) [DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PushedFilters: [IsNotNull(DATE)], ReadSchema: struct<DATE:timestamp,ID:string,SESSIONID:string,IP:string,NAME:string,BYTES_SENT:int,DELTA:int,W...

在此说明中,PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false)表示它正在向凤凰台发送请求吗?如果真是这样,那是可以理解的,因为我们从表中选择所有将数据发送给凤凰的数据。 令我感到困惑的是以下代码

spark.sql("select * from c where date>='2018-07-03 00:00:00' and date<'2018-07-04 00:00:00'").createTempView("d")

spark.sql("select * from d").explain(True)

这给了我以下解释

 == Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation `d`

== Analyzed Logical Plan ==
DATE: timestamp, ID: string, SESSIONID: string, IP: string, NAME: string, BYTES_SENT: int, DELTA: int, W_ID: string
Project [DATE#48, ID#49, SESSIONID#50, IP#51, NAME#52, BYTES_SENT#53, DELTA#54, W_ID#55]
+- SubqueryAlias d
   +- Project [DATE#48, ID#49, SESSIONID#50, IP#51, NAME#52, BYTES_SENT#53, DELTA#54, W_ID#55]
      +- Filter ((cast(date#48 as string) >= 2018-07-03 00:00:00) && (cast(date#48 as string) < 2018-07-04 00:00:00))
         +- SubqueryAlias c
            +- Project [DATE#48, ID#49, SESSIONID#50, IP#51, NAME#52, BYTES_SENT#53, DELTA#54, W_ID#55]
               +- Filter ((cast(date#48 as string) >= 2018-07-02 00:00:00) && (cast(date#48 as string) < 2018-07-04 00:00:00))
                  +- SubqueryAlias table
                     +- Relation[DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false)

== Optimized Logical Plan ==
Filter (((isnotnull(date#48) && (cast(date#48 as string) >= 2018-07-02 00:00:00)) && (cast(date#48 as string) < 2018-07-04 00:00:00)) && (cast(date#48 as string) >= 2018-07-03 00:00:00))
+- Relation[DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false)

== Physical Plan ==
*(1) Filter ((((cast(DATE#48 as string) >= 2018-07-02 00:00:00) && (cast(DATE#48 as string) < 2018-07-04 00:00:00)) && (cast(DATE#48 as string) >= 2018-07-03 00:00:00)) && isnotnull(DATE#48))
+- *(1) Scan PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false) [DATE#48,ID#49,SESSIONID#50,IP#51,NAME#52,BYTES_SENT#53,DELTA#54,W_ID#55] PushedFilters: [IsNotNull(DATE)], ReadSchema: struct<DATE:timestamp,ID:string,SESSIONID:string,IP:string,NAME:string,BYTES_SENT:int,DELTA:int,W...

在这里我们再次可以看到PhoenixRelation(data_table,10.10.5.20,10.10.5.21,10.10.5.22,10.10.5.23:2181,false) 是否再次尝试连接到凤凰台?第一次没有加载数据吗?我以为它将加载数据并将数据存储在数据帧中,如果我没有记错的话,这是一个内存表,如果我想从另一个数据帧中获取数据,那为什么火花试图连接到凤凰。 我有一个问题,需要对一个数据帧进行一些计算,并且需要进行1k循环,如果我从CSV文件加载数据,整个操作大约需要2秒钟,但是当我尝试从phoenix表中获取数据时很多的时间。我确实在到的循环上进行了说明,每次我获得与上述相同的phoenix relation时,我在这里一定做错了,但我无法弄清楚。这是spark_phoenix的预期行为吗?

0 个答案:

没有答案