Question

有没有人知道如何设置`

streamingContext.fileStream [KeyClass，ValueClass，InputFormatClass]（dataDirectory）

实际使用二进制文件。

我在哪里可以找到所有的inputformatClass？文件没有链接。我想ValueClass与 inputformatClass不知怎的。
在使用二进制文件方法的非流式版本中，我可以得到每个文件的ByteArrays。有没有办法可以得到相同的 sparkStreaming？如果没有，我在哪里可以找到这些细节。意思是 inputformat支持及其生成的值类。终于可以一个选择任何KeyClass，是不是所有这些元素都连接了？

如果有人澄清了该方法的使用。

EDIT1

我尝试了以下内容：

val bfiles = ssc.fileStreamBytesWritable，BytesWritable，SequenceFileAsBinaryInputFormat

然而，编译器抱怨如下：

[error] /xxxxxxxxx/src/main/scala/EstimatorStreamingApp.scala:14: type arguments [org.apache.hadoop.io.BytesWritable,org.apache.hadoop.io.BytesWritable,org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat] conform to the bounds of none of the overloaded alternatives of
[error]  value fileStream: [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean, conf: org.apache.hadoop.conf.Configuration)(implicit evidence$10: scala.reflect.ClassTag[K], implicit evidence$11: scala.reflect.ClassTag[V], implicit evidence$12: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String, filter: org.apache.hadoop.fs.Path => Boolean, newFilesOnly: Boolean)(implicit evidence$7: scala.reflect.ClassTag[K], implicit evidence$8: scala.reflect.ClassTag[V], implicit evidence$9: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)] <and> [K, V, F <: org.apache.hadoop.mapreduce.InputFormat[K,V]](directory: String)(implicit evidence$4: scala.reflect.ClassTag[K], implicit evidence$5: scala.reflect.ClassTag[V], implicit evidence$6: scala.reflect.ClassTag[F])org.apache.spark.streaming.dstream.InputDStream[(K, V)]
[error]   val bfiles = ssc.fileStream[BytesWritable, BytesWritable, SequenceFileAsBinaryInputFormat]("/xxxxxxxxx/Casalini_streamed")

我做错了什么？

Answer 1

按照链接阅读有关所有hadoop input formats

的信息

我发现here有关序列文件格式的详细记录答案。

由于导入不匹配，您将面临编译问题。 Hadoop Mapred vs mapreduce

<强> E.g。

爪哇

JavaPairInputDStream<Text,BytesWritable> dstream=
        sc.fileStream("/somepath",org.apache.hadoop.io.Text.class,
        org.apache.hadoop.io.BytesWritable.class,
    org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat.class);

我没有尝试使用scala，但它应该是类似的东西;

val dstream = sc.fileStream("/somepath", 
        classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.BytesWritable],
        classOf[org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat] ) ;

Answer 2

我终于得到了编译。

编译问题在导入中。我用了

import org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat

我用

替换了它

import org.apache.hadoop.mapreduce.lib.input.SequenceFileAsBinaryInputFormat

然后它有效。但是我不明白为什么。我不明白两个层次结构之间的区别。这两个文件似乎具有相同的内容。所以很难说。如果有人可以帮助澄清这一点，我认为这将有很大帮助

使用Spark Streaming阅读binaryFile

2 个答案: