我有以下火花作业:
org.apache.james.mime4j.storage.DefaultStorageProvider
当我运行时,它会给我以下错误:
from __future__ import print_function
import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("PySpark Cassandra Test")
sc = CassandraSparkContext(conf=conf)
stream = StreamingContext(sc, 2)
rdd=sc.cassandraTable("keyspace2","users").collect()
#print rdd
stream.start()
stream.awaitTermination()
sc.stop()
shell脚本我运行:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute
将火花流与kafka进行比较,我在上面的代码中遗漏了这一行:
./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py
我实际上在使用kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})
,但对于cassandra,我在文档上看不到这样的内容。如何启动spark streaming和cassandra之间的流媒体?
版本:
createStream
答案 0 :(得分:0)
要从Cassandra表创建DStream,您可以使用ConstantInputDStream
提供从Cassandra表创建的RDD作为输入。这将导致RDD在每个DStream间隔上实现。
请注意,大小不断增大的大型表格或表格会对流式传输作业的性能产生负面影响。