Question

我对Spark fileStream()方法的理解是，它需要三种类型作为参数：Key，Value和Format。对于文本文件，相应的类型为：LongWritable，Text和TextInputFormat。

首先，我想了解这些类型的性质。直觉上，我猜想在这种情况下Key是文件的行号，而Value是该行的文本。因此，在以下文本文件示例中：

Hello
Test
Another Test

DStream的第一行有Key 1（0？）和Value Hello。< / p>

这是对的吗？

我的问题的第二部分：我查看了ParquetInputFormat的反编译实现，我发现了一些奇怪的事情：

public class ParquetInputFormat<T>
       extends FileInputFormat<Void, T> {
//...

public class TextInputFormat
       extends FileInputFormat<LongWritable, Text>
       implements JobConfigurable {
//...

TextInputFormat扩展FileInputFormat类型LongWritable和Text，而ParquetInputFormat扩展了相同的类Void和T }。

这是否意味着我必须创建一个Value类来保存我的镶木地板数据的整行，然后将类型<Void, MyClass, ParquetInputFormat<MyClass>>传递给ssc.fileStream()？

如果是，我该如何实施MyClass？

编辑1 ：我注意到要传递给readSupportClass个对象的ParquetInputFormat。这是什么类，它是如何用于解析镶木地板文件的？是否有一些文件涵盖了这个？

编辑2 ：据我所知，这是不可能。如果有人知道如何将镶木地板文件传输到Spark，请随时分享...

Answer 1

我在Spark Streaming中读取镶木地板文件的示例如下。

val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "parquet.avro.AvroReadSupport")
val stream = ssc.fileStream[Void, GenericRecord, ParquetInputFormat[GenericRecord]](
  directory, { path: Path => path.toString.endsWith("parquet") }, true, ssc.sparkContext.hadoopConfiguration)

val lines = stream.map(row => {
  println("row:" + row.toString())
  row
})

有些观点......

记录类型为GenericRecord
readSupportClass是AvroReadSupport
将配置传递给fileStream
将parquet.read.support.class设置为Configuration

我在下面提到了创建样本的源代码我也找不到好的例子我想等一下。

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala

Answer 2

尝试这样：

val ssc = new StreamingContext(conf, Seconds(5))
var schema =StructType(Seq(
      StructField("a", StringType, nullable = false),
      ........

     ))
val schemaJson=schema.json

val fileDir="/tmp/fileDir"
ssc.sparkContext.hadoopConfiguration.set("parquet.read.support.class", "org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport")  ssc.sparkContext.hadoopConfiguration.set("org.apache.spark.sql.parquet.row.requested_schema", schemaJson)
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_WRITE_LEGACY_FORMAT.key, "false")
ssc.sparkContext.hadoopConfiguration.set(SQLConf.PARQUET_BINARY_AS_STRING.key, "false")

val streamRdd = ssc.fileStream[Void, UnsafeRow, ParquetInputFormat[UnsafeRow]](fileDir,(t: org.apache.hadoop.fs.Path) => true, false)

streamRdd.count().print()

ssc.start()
ssc.awaitTermination()

顺便说一下，我正在使用spark 2.1.0。

如何使用`ssc.fileStream（）`读取镶木地板文件？传递给`ssc.fileStream（）`的类型是什么？

2 个答案:

尝试这样：