如何将spark pair rdd作为文件存入HDFS?

时间:2017-01-09 17:08:39

标签: hdfs apache-kafka spark-streaming rdd apache-kafka-connect

嗨,我创建了一个包含3个分区和2个副本的kafka主题。我尝试将kafka中的消息/记录发布到spark流(对于某些进程),然后将数据存储到HDFS中。我试图将对RDD存储为文本文件,但无效。

此代码无效,

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-3-7d058c7b9ac1>", line 9, in get_data
    data = json.loads(r.text)
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

The above exception was the direct cause of the following exception:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-11-6bb417b3ed92> in <module>()
      3 p = Pool(5)
      4 # get data/response only for _unique_ strings (parameters)
----> 5 rslt = pd.Series(p.map(get_data, df2['sents'].unique().tolist()),index=df['sents'].unique())
      6 # map responses back to DF (it'll take care of duplicates)
      7 df['new'] = df2['ColA'].map(rslt)

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

控制台输出:

JavaPairInputDStream<String, String> directKafkaStream = KafkaUtils
                .createDirectStream(ssc, String.class, String.class,
                        StringDecoder.class, StringDecoder.class, kafkaParams,
                        topics);
        directKafkaStream.foreachRDD(rdd -> {
                     if(!rdd.isEmpty()){
                         rdd.saveAsTextFile(path);
                     }
                     }
                );

实际上是我的pom.xml

&#13;
&#13;
17/01/09 17:25:39 INFO KafkaRDD: Computing topic filebeat, partition 1 offsets 20 -> 32
17/01/09 17:25:39 INFO VerifiableProperties: Verifying properties
17/01/09 17:25:39 INFO VerifiableProperties: Property group.id is overridden to 
17/01/09 17:25:39 INFO VerifiableProperties: Property zookeeper.connect is overridden to localhost:2181
17/01/09 17:25:39 INFO KafkaRDD: Computing topic filebeat, partition 0 offsets 22 -> 34
17/01/09 17:25:39 INFO VerifiableProperties: Verifying properties
17/01/09 17:25:39 INFO VerifiableProperties: Property group.id is overridden to 
17/01/09 17:25:39 INFO VerifiableProperties: Property zookeeper.connect is overridden to localhost:2181
17/01/09 17:25:40 INFO JobScheduler: Added jobs for time 1483979140000 ms
17/01/09 17:25:40 ERROR Utils: Aborting task
java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream
    at kafka.message.ByteBufferMessageSet$.decompress(ByteBufferMessageSet.scala:65)
    at kafka.message.ByteBufferMessageSet$$anon$1.makeNextOuter(ByteBufferMessageSet.scala:179)
    at kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:192)
    at kafka.message.ByteBufferMessageSet$$anon$1.makeNext(ByteBufferMessageSet.scala:146)
    at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
    at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
    at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:764)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:211)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1203)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.message.KafkaLZ4BlockOutputStream
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 22 more
17/01/09 17:25:40 ERROR Utils: Aborting task
&#13;
&#13;
&#13;

0 个答案:

没有答案