将cassandra数据读入pyspark DataFrame

时间:2016-05-18 14:57:08

标签: cassandra pyspark

我按如下方式启动了pyspark:

[idf@node1 python]$ pyspark --conf spark.cassandra.connection.host=10.0.0.60
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
16/05/18 10:52:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/05/18 10:52:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.

当我尝试做一些简单的事情时,我会得到一堆没有帮助的错误:

In [1]: import pyspark_cassandra

In [2]: user = sc.cassandraTable("tickdata", "timeseries").toDF()
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-2-59f7356f4bac> in <module>()
----> 1 user = sc.cassandraTable("tickdata", "timeseries").toDF()

/home/idf/anaconda2/lib/python2.7/site-packages/pyspark_cassandra-0.3.5-py2.7.egg/pyspark_cassandra/context.pyc in cassandraTable(self, *args, **kwargs)
     28     def cassandraTable(self, *args, **kwargs):
     29         """Returns a CassandraTableScanRDD for the given keyspace and table"""
---> 30         return CassandraTableScanRDD(self, *args, **kwargs)

/home/idf/anaconda2/lib/python2.7/site-packages/pyspark_cassandra-0.3.5-py2.7.egg/pyspark_cassandra/rdd.pyc in __init__(self, ctx, keyspace, table, row_format, read_conf, **read_conf_kwargs)
    233         read_conf = as_java_object(ctx._gateway, self.read_conf.settings())
    234 
--> 235         self.crdd = self._helper \
    236             .cassandraTable(
    237                 ctx._jsc,

/home/idf/anaconda2/lib/python2.7/site-packages/pyspark_cassandra-0.3.5-py2.7.egg/pyspark_cassandra/rdd.pyc in _helper(self)
    130     @property
    131     def _helper(self):
--> 132         return helper(self.ctx)
    133 
    134 

/home/idf/anaconda2/lib/python2.7/site-packages/pyspark_cassandra-0.3.5-py2.7.egg/pyspark_cassandra/util.pyc in helper(ctx)
     91 
     92     if not _helper:
---> 93         _helper = load_class(ctx, "pyspark_cassandra.PythonHelper").newInstance()
     94 
     95     return _helper

/home/idf/anaconda2/lib/python2.7/site-packages/pyspark_cassandra-0.3.5-py2.7.egg/pyspark_cassandra/util.pyc in load_class(ctx, name)
     83 def load_class(ctx, name):
     84     return ctx._jvm.java.lang.Thread.currentThread().getContextClassLoader() \
---> 85             .loadClass(name)
     86 
     87 _helper = None

/opt/spark-latest/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/opt/spark-latest/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/opt/spark-latest/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling o20.loadClass.
: java.lang.ClassNotFoundException: pyspark_cassandra.PythonHelper
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:745)


In [3]: 

修改1

这样做会让我更进一步

[idf@node1 python]$ pyspark --packages TargetHolding/pyspark-cassandra:0.3.5 --conf spark.cassandra.connection.host=10.0.0.60
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Ivy Default Cache set to: /home/idf/.ivy2/cache
The jars for the packages stored in: /home/idf/.ivy2/jars
:: loading settings :: url = jar:file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
TargetHolding#pyspark-cassandra added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
    confs: [default]
    found TargetHolding#pyspark-cassandra;0.3.5 in spark-packages
    found com.datastax.spark#spark-cassandra-connector-java_2.10;1.6.0-M1 in list
    found com.datastax.spark#spark-cassandra-connector_2.10;1.6.0-M1 in list
    found org.apache.cassandra#cassandra-clientutil;3.0.2 in list
    found com.datastax.cassandra#cassandra-driver-core;3.0.0 in list
    found io.netty#netty-handler;4.0.33.Final in central
    found io.netty#netty-buffer;4.0.33.Final in central
    found io.netty#netty-common;4.0.33.Final in central
    found io.netty#netty-transport;4.0.33.Final in central
    found io.netty#netty-codec;4.0.33.Final in central
    found io.dropwizard.metrics#metrics-core;3.1.2 in list
    found org.slf4j#slf4j-api;1.7.7 in list
    found org.apache.commons#commons-lang3;3.3.2 in list
    found com.google.guava#guava;16.0.1 in list
    found org.joda#joda-convert;1.2 in list
    found joda-time#joda-time;2.3 in list
    found com.twitter#jsr166e;1.1.0 in list
    found org.scala-lang#scala-reflect;2.10.5 in list
:: resolution report :: resolve 902ms :: artifacts dl 18ms
    :: modules in use:
    TargetHolding#pyspark-cassandra;0.3.5 from spark-packages in [default]
    com.datastax.cassandra#cassandra-driver-core;3.0.0 from list in [default]
    com.datastax.spark#spark-cassandra-connector-java_2.10;1.6.0-M1 from list in [default]
    com.datastax.spark#spark-cassandra-connector_2.10;1.6.0-M1 from list in [default]
    com.google.guava#guava;16.0.1 from list in [default]
    com.twitter#jsr166e;1.1.0 from list in [default]
    io.dropwizard.metrics#metrics-core;3.1.2 from list in [default]
    io.netty#netty-buffer;4.0.33.Final from central in [default]
    io.netty#netty-codec;4.0.33.Final from central in [default]
    io.netty#netty-common;4.0.33.Final from central in [default]
    io.netty#netty-handler;4.0.33.Final from central in [default]
    io.netty#netty-transport;4.0.33.Final from central in [default]
    joda-time#joda-time;2.3 from list in [default]
    org.apache.cassandra#cassandra-clientutil;3.0.2 from list in [default]
    org.apache.commons#commons-lang3;3.3.2 from list in [default]
    org.joda#joda-convert;1.2 from list in [default]
    org.scala-lang#scala-reflect;2.10.5 from list in [default]
    org.slf4j#slf4j-api;1.7.7 from list in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   18  |   0   |   0   |   0   ||   18  |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
    confs: [default]
    0 artifacts copied, 18 already retrieved (0kB/22ms)
16/05/18 12:06:12 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: import pyspark_cassandra

In [2]: rdd = sc.cassandraTable("tickdata", "timeseries")
16/05/18 12:08:36 WARN ClosureCleaner: Expected a closure; got pyspark_cassandra.ToRow$
16/05/18 12:08:36 WARN ClosureCleaner: Expected a closure; got pyspark_util.BatchPickler

In [3]: 

1 个答案:

答案 0 :(得分:1)

pyspark-cassandra需要Python和Scala代码才能运行。由于你的问题不清楚你如何包括我的猜测,你只是在PYTHONPATH添加了Python代码。

如果您使用Spark与Scala 2.10(Spark&lt; = 1.6的默认构建),您可以将pyspark-cassandra--packages一起使用:

pyspark --packages TargetHolding:pyspark-cassandra:0.3.5 \
        --conf spark.cassandra.connection.host=10.0.0.60

否则您必须自己使用--jars--driver-class-path--py-files参数构建它(这三个都是必需的)。