如何将Spark Riak连接器与pyspark一起使用?

时间:2016-11-25 07:01:35

标签: apache-spark pyspark riak

我按照https://github.com/basho/spark-riak-connector上的说明运行Spark 2.0.2-hadoop2.7。

尝试过 -

1)pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0

2)pyspark --driver-class-path /path/to/spark-riak-connector_2.11-1.6.0-uber.jar

3)将spark.driver.extraClassPath /path/to/jars/*添加到主人的spark-default.conf

4)尝试旧版本的连接器(1.5.0和1.5.1)

我可以在主人的网站上验证,在pyspark的应用程序环境中加载了riak jar。我还加倍检查spark的scala版本是2.11。

但..无论我做什么,我都没有pyspark_riak导入

>>> import pyspark_riak
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named pyspark_riak

我该如何解决?

当尝试选项#1时,正在加载罐子,我得到的报告看起来很好:

:: modules in use:
    com.basho.riak#riak-client;2.0.7 from central in [default]
    com.basho.riak#spark-riak-connector_2.11;1.6.0 from central in [default]
    com.fasterxml.jackson.core#jackson-annotations;2.8.0 from central in [default]
    com.fasterxml.jackson.core#jackson-core;2.8.0 from central in [default]
    com.fasterxml.jackson.core#jackson-databind;2.8.0 from central in [default]
    com.fasterxml.jackson.datatype#jackson-datatype-joda;2.4.4 from central in [default]
    com.fasterxml.jackson.module#jackson-module-scala_2.11;2.4.4 from central in [default]
    com.google.guava#guava;14.0.1 from central in [default]
    joda-time#joda-time;2.2 from central in [default]
    org.erlang.otp#jinterface;1.6.1 from central in [default]
    org.scala-lang#scala-reflect;2.11.2 from central in [default]
    :: evicted modules:
    com.fasterxml.jackson.core#jackson-core;2.4.4 by [com.fasterxml.jackson.core#jackson-core;2.8.0] in [default]
    com.fasterxml.jackson.core#jackson-annotations;2.4.4 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
    com.fasterxml.jackson.core#jackson-databind;2.4.4 by [com.fasterxml.jackson.core#jackson-databind;2.8.0] in [default]
    com.fasterxml.jackson.core#jackson-annotations;2.4.0 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   15  |   11  |   11  |   4   ||   11  |   11  |
    ---------------------------------------------------------------------

如果我打印sys.path我也可以看到/tmp/spark-b2396e0a-f329-4066-b3b1-4e8c21944a66/userFiles-7e423d94-5aa2-4fe4-935a-e06ab2d423ae/com.basho.riak_spark-riak-connector_2.11-1.6.0.jar(我确认存在)

1 个答案:

答案 0 :(得分:2)

来自存储库的

spark-riak-connector没有pyspark支持。但你可以自己构建它并附加到pyspark:

git clone https://github.com/basho/spark-riak-connector.git
cd spark-riak-connector/
python connector/python/setup.py bdist_egg # creates egg file inside connector/python/dist/

然后将新创建的egg添加到python路径:

pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0
>>> import sys
>>> sys.path.append('connector/python/dist/pyspark_riak-1.0.0-py2.7.egg')
>>> import pyspark_riak
>>> 

但是要小心使用带有spark 2.0.2的spark-riak-connector - 我看到最新的软件包版本是用spark 1.6.2测试的,API可能无法按预期工作。