Apache Phoenix for Spark无法正常工作

时间:2017-04-26 17:27:58

标签: apache-spark phoenix

我无法通过Spark(2.1.0)连接到Phoenix(4.10),基于"使用Data Source API加载为DataFrame"凤凰网站上的例子。我正在使用lastet(Phoenix 4.10)和Hbase 1.2.5。我可以通过Phoenix(sqlline客户端)在Hbase中创建一个表。 Spark中返回的错误如下:

scala> val df = sqlContext.load("org.apache.phoenix.spark",Map("table" -> "test", "zkUrl" -> "localhost:2181"))

warning: there was one deprecation warning; re-run with -deprecation for details
java.sql.SQLException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hbase.TableExistsException): SYSTEM.MUTEX
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2465)
at org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:2382)
at org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:76)
at org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:2382)
at org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:255)
at org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnection(PhoenixEmbeddedDriver.java:149)
at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:221)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:98)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
at org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
at org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:292)
at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:118)
at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:60)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:40)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:389)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:965)
... 50 elided
Caused by: org.apache.hadoop.ipc.RemoteException: SYSTEM.MUTEX
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.prepareCreate(CreateTableProcedure.java:285)
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:106)
at org.apache.hadoop.hbase.master.procedure.CreateTableProcedure.executeFromState(CreateTableProcedure.java:58)
at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:119)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:498)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1147)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:942)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execLoop(ProcedureExecutor.java:895)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$400(ProcedureExecutor.java:77)
at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$2.run(ProcedureExecutor.java:497)

UPDATE 1 :如果通过HBase删除SYSTEM.MUTEX表,它可以正常工作。

UPDATE 2 :删除SYSTEM.MUTEX表后,只要通过sqlContext.load()与Phoenix建立连接,就会重新创建此表,这意味着在加载另一个表时或即使重新加载相同的表,也会在尝试重新创建SYSTEM.MUTEX表时抛出相同的异常。

更新3 :似乎如果你在没有SYSTEM.MUTEX表在Hbase中的情况下启动,它对于相同的Spark会话工作正常,即你可以连接到任意数量的表,但是,如果初始化了另一个Spark会话,则从第二个Spark上下文抛出相同的异常。

根据https://issues.apache.org/jira/browse/PHOENIX-3814的建议(包括Spark类路径中的hbase-client jar),它仍然会出现相同的异常。

更新4 :我最终做了凤凰项目的自定义构建。修复是更改行号。类org.apache.phoenix.query.ConnectionQueryServicesImpl(phoenix-core)中的2427到if (!admin.tableExists(SYSTEM_MUTEX_NAME_BYTES)) createSysMutexTable(admin);。此外,https://phoenix.apache.org/phoenix_spark.html给出的数据框的加载示例不正确,因为它基于DataFrame类的deprecated / removed保存方法,而是需要使用write方法。见下面的例子:

./bin/spark-shell --master local[4] --deploy-mode client --jars path_to_to/phoenix-4.10.1-HBase-1.2-SNAPSHOT-client.jar
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
import org.apache.spark.sql.SaveMode

val sqlContext = new SQLContext(sc)

val df = sqlContext.load("org.apache.phoenix.spark",Map("table" -> "name_of_input_table_in_phoenix", "zkUrl" -> "localhost:2181"))
df.write.format("org.apache.phoenix.spark").mode(SaveMode.Overwrite).options(Map("table" -> "name_of_output_table_in_phoenix","zkUrl" -> "localhost:2181")).save()

请注意,输出表应该已经存在于Phoenix中且具有正确的架构。请注意,我正在使用自定义构建,因此在客户端jar名称中使用SNAPSHOT。

1 个答案:

答案 0 :(得分:1)

当前4.10版本似乎有这个错误,在初始化时(当在SQLContext上调用load时),Phoenix客户端尝试创建SYSTEM.MUTEX表(org.apache.phoenix.query.ConnectionQueryServicesImpl中的createSysMutexTable方法(凤凰核心)。但是,如果此表已存在,则Hbase将抛出TableExistsException。尽管createSysMutexTable方法捕获了TableAlreadyExists execption,但这与Hbase抛出的不同,并且Hbase的异常被包装。结果发生未处理的异常。解决方案是更新代码,并且只有在Mutex表不存在时才调用createSysMutexTable方法。有关完整的解决方案和示例代码,请参阅更新4