PySpark HBase / Phoenix集成

时间:2015-09-15 11:23:47

标签: apache-spark pyspark phoenix

我应该将凤凰数据读入pyspark。

编辑: 我正在使用Spark HBase转换器:

以下是代码段:

port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)

回溯:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
    jconf, batchSize)
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
    at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)

非常感谢任何帮助。

谢谢! /蒂娜

2 个答案:

答案 0 :(得分:1)

建议使用spark phoenix插件。 请查看有关凤凰火花插件here

的详细信息

环境:使用AWS EMR 5.10,PySpark

进行测试

以下是步骤

  1. 在凤凰https://phoenix.apache.org/language/创建表格 打开Phoenix shell

    “/ usr / lib / phoenix / bin / sqlline.py”

    DROP TABLE IF EXISTS TableName;

    CREATE TABLE TableName(DOMAIN VARCHAR主键);

    UPSERT INTO TableName(DOMAIN)VALUES('foo');

  2. 下载spark phoenix插件jar 从https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3下载spark phoenix插件jar 你需要凤凰 - HBase - client.jar,我根据我的凤凰和hbase版本使用了phoenix-4.11.0-HBase-1.3-client.jar

  3. 从您的hadoop主目录中,设置以下变量:

    phoenix_jars = /家庭/用户/阿帕奇-凤-4.11.0-HBase的-1.3滨/凤-4.11.0-HBase的-1.3-client.jar中

  4. 启动PySpark shell并在Driver和executter类路径中添加依赖项

    pyspark --jars $ {phoenix_jars} --conf spark.executor.extraClassPath = $ {phoenix_jars}

  5. - 创建ZooKeeper URL,替换为您的集群zookeeper仲裁,您可以从hbase-site.xml进行检查

    emrMaster = "ZooKeeper URL" 
    
    df = sqlContext.read \
    .format("org.apache.phoenix.spark") \
    .option("table", "TableName") \
    .option("zkUrl", emrMaster) \
    .load() 
    
    df.show()
    df.columns
    df.printSchema()
    df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
    df1.show() 
    
    df1.write \
      .format("org.apache.phoenix.spark") \
      .mode("overwrite") \
      .option("table", "TableName") \
      .option("zkUrl", emrMaster) \
      .save()
    

答案 1 :(得分:0)