我是Python 2.6.x,PySpark,spark 1.6,HBase 1.1的新手,我正在尝试使用Apache spark插件从表中读取数据。
读取数据:
dfRows = sparkConfig.getSqlContext().read\
.format('org.apache.phoenix.spark')\
.option('table', 'TableA')\
.option('zkUrl', 'xxx:2181:/hbase-secure')\
.load()
另外,我使用下面的args和jar
使用spark-submit
运行python文件
spark-submit --master yarn-client --executor-memory 24G --driver-memory 20G --num-executors 10 --queue aQueue --jars /usr/hdp/2.6.1.40-4/phoenix/lib/phoenix-core-4.7.0.2.6.1.40-4.jar,/usr/hdp/current/phoenix-client/lib/hbase-client.jar,/usr/hdp/current/phoenix-client/phoenix-client.jar,/usr/hdp/2.6.1.40-4/hive2/lib/twill-zookeeper-0.6.0-incubating.jar,/usr/hdp/2.6.1.40-4/hive2/lib/twill-discovery-api-0.6.0-incubating.jar,/usr/hdp/2.6.1.40-4/hive2/lib/hive-hbase-handler.jar Test.py
这样可以正常工作,但是当我执行dfRows.first()
时,它会抛出以下异常:
Caused by: java.lang.IllegalStateException: unread block data
at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2449)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1385)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:207)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
... 1 more