我正在尝试从Kafka读取消息,处理数据,然后将数据添加到cassandra中,就好像它是RDD一样。
我的麻烦是将数据保存回cassandra。
from __future__ import print_function
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark import SparkConf, SparkContext
appName = 'Kafka_Cassandra_Test'
kafkaBrokers = '1.2.3.4:9092'
topic = 'test'
cassandraHosts = '1,2,3'
sparkMaster = 'spark://mysparkmaster:7077'
if __name__ == "__main__":
conf = SparkConf()
conf.set('spark.cassandra.connection.host', cassandraHosts)
sc = SparkContext(sparkMaster, appName, conf=conf)
ssc = StreamingContext(sc, 1)
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": kafkaBrokers})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a+b)
counts.saveToCassandra('coreglead_v2', 'wordcount')
ssc.start()
ssc.awaitTermination()
错误:
[root@gasweb2 ~]# spark-submit --jars /var/spark/lib/spark-streaming-kafka-assembly_2.10-1.6.0.jar --packages datastax:spark-cassandra-connector:1.5.0-RC1-s_2.11 /var/spark/scripts/kafka_cassandra.py
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/var/spark/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
datastax#spark-cassandra-connector added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found datastax#spark-cassandra-connector;1.5.0-RC1-s_2.11 in spark-packages
found org.apache.cassandra#cassandra-clientutil;2.2.2 in central
found com.datastax.cassandra#cassandra-driver-core;3.0.0-rc1 in central
found io.netty#netty-handler;4.0.33.Final in central
found io.netty#netty-buffer;4.0.33.Final in central
found io.netty#netty-common;4.0.33.Final in central
found io.netty#netty-transport;4.0.33.Final in central
found io.netty#netty-codec;4.0.33.Final in central
found io.dropwizard.metrics#metrics-core;3.1.2 in central
found org.slf4j#slf4j-api;1.7.7 in central
found org.apache.commons#commons-lang3;3.3.2 in central
found com.google.guava#guava;16.0.1 in central
found org.joda#joda-convert;1.2 in central
found joda-time#joda-time;2.3 in central
found com.twitter#jsr166e;1.1.0 in central
found org.scala-lang#scala-reflect;2.11.7 in central
:: resolution report :: resolve 647ms :: artifacts dl 15ms
:: modules in use:
com.datastax.cassandra#cassandra-driver-core;3.0.0-rc1 from central in [default]
com.google.guava#guava;16.0.1 from central in [default]
com.twitter#jsr166e;1.1.0 from central in [default]
datastax#spark-cassandra-connector;1.5.0-RC1-s_2.11 from spark-packages in [default]
io.dropwizard.metrics#metrics-core;3.1.2 from central in [default]
io.netty#netty-buffer;4.0.33.Final from central in [default]
io.netty#netty-codec;4.0.33.Final from central in [default]
io.netty#netty-common;4.0.33.Final from central in [default]
io.netty#netty-handler;4.0.33.Final from central in [default]
io.netty#netty-transport;4.0.33.Final from central in [default]
joda-time#joda-time;2.3 from central in [default]
org.apache.cassandra#cassandra-clientutil;2.2.2 from central in [default]
org.apache.commons#commons-lang3;3.3.2 from central in [default]
org.joda#joda-convert;1.2 from central in [default]
org.scala-lang#scala-reflect;2.11.7 from central in [default]
org.slf4j#slf4j-api;1.7.7 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 16 | 0 | 0 | 0 || 16 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 16 already retrieved (0kB/14ms)
16/02/15 16:26:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "/var/spark/scripts/kafka_cassandra.py", line 27, in <module>
counts.saveToCassandra('coreglead_v2', 'wordcount')
AttributeError: 'TransformedDStream' object has no attribute 'saveToCassandra'
从搜索中我发现this GitHub issue,但这似乎与不同的库有关(我不能使用这个库,因为我使用的是Cassandra 3.0,但它还不支持)。
目标是从单个消息创建聚合数据(wordcount仅用于测试)并将其插入到多个表中。
我接近只使用Datastax Python Driver并自己编写语句,但有没有更好的方法来实现这一目标?
答案 0 :(得分:3)
你正在使用Datastax的Spark Cassandra Connector,它在RDD / DStream级别不支持python。仅支持Dataframe。有关详细信息,请参阅docs。
我为上述连接器编写了一个包装器:PySpark Cassandra。 Datastax对连接器的功能并不完整,但很多东西都存在。此外,如果性能很重要,那么调查性能损失可能是值得的。
最后,Spark发布了python example使用hadoop mapreduce的CqlInput / OutputFormat。在我看来,这不是一个非常适合开发人员的选项,但它就在那里。
答案 1 :(得分:0)
通过问题说明查看您的代码和readint:您似乎没有使用任何Cassandra连接器。 Spark没有开箱即用的Cassandra支持,因为RDD和DStream数据类型没有saveToCassandra
方法。您需要导入外部Spark-Cassandra连接器,该连接器扩展RDD和DStream类型以支持Cassandra集成。
这就是您收到错误的原因:Python无法在DStream类型上找到任何函数saveToCassandra
,因为当前都不存在。
您需要获取DataStax连接器或其他连接器以使用saveToCassandra
扩展DStream类型。