如何启用从Cassandra到Spark的流媒体?

时间:2016-01-26 12:47:37

标签: apache-spark cassandra pyspark spark-streaming datastax

我有以下火花作业

org.apache.james.mime4j.storage.DefaultStorageProvider

当我运行时,它会给我以下错误

from __future__ import print_function

import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext

if __name__ == "__main__":

    conf = SparkConf().setAppName("PySpark Cassandra Test")
    sc = CassandraSparkContext(conf=conf)
    stream = StreamingContext(sc, 2)

    rdd=sc.cassandraTable("keyspace2","users").collect()
    #print rdd
    stream.start()
    stream.awaitTermination()
    sc.stop() 

shell脚本我运行:

ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute

将火花流与kafka进行比较,我在上面的代码中遗漏了这一行:

./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py

我实际上在使用kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1}) ,但对于cassandra,我在文档上看不到这样的内容。如何启动spark streaming和cassandra之间的流媒体?

版本

createStream

1 个答案:

答案 0 :(得分:0)

要从Cassandra表创建DStream,您可以使用ConstantInputDStream提供从Cassandra表创建的RDD作为输入。这将导致RDD在每个DStream间隔上实现。

请注意,大小不断增大的大型表格或表格会对流式传输作业的性能产生负面影响。

另请参阅:Reading from Cassandra using Spark Streaming示例。