Question

我正在学习Spark流媒体，我可能会遇到简单的问题。我想从目录中摄取整个文本文件。这里通常提到的方法是wholeTextFile，而不是按行分割文件的textFile。但是，据我所知，该方法在流式上下文中不可用。

如何简单地实现类似的效果 - 在流式传输时获取（文件名，整个文件内容）？

使用streamingcontext和sparksession的Scala示例会很棒。

Answer 1

我也在流上下文中搜索了wholeTextFile，但在官方API中找不到任何内容。

尽管，我遇到了私有的WholeTextFileInputFormat类，该类可以与fileStream一起用于在(file path, file content)元组上进行流传输。但是，由于此类是私有的，因此不能直接使用。我的解决方案可能有点笨拙：

将WholeTextFileInputFormat.scala中的文件WholeTextFileRecordReader.scala和Apache Spark repository复制到您的项目中
相应地调整包名称空间（必要时还可以调整访问修饰符）
使用fileStream格式化程序使用WholeTextFileInputFormat创建流

这里是Scala中的一个示例，假设ssc是您的StreamingContext。

import org.apache.hadoop.io.Text

val directory = "/the/directory/to/watch"
val stream = ssc.fileStream[Text, Text, WholeTextFileInputFormat](directory)

Answer 2

Well OP自2017年以来可能再也没有问题了，但是我实际上看起来像这样，当我找到解决方案时就放弃了，Spark 3将采用一种可以用来实现这一目标。

https://spark.apache.org/docs/3.0.0-preview/sql-data-sources-binaryFile.html

我的实现与此类似

import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType, TimestampType, LongType, BinaryType}

// This schema is fixed, I don't know if there is an object ready for it, didn't look at it tbh
val schema = StructType(List(
  StructField("path",StringType,false),
  StructField("modificationTime",TimestampType,false),
  StructField("length",LongType,false),
  StructField("content",BinaryType,true)
))

val myDf = spark.readStream 
  .format(...)
  .option("fileFormat", "binaryFile") 
  .schema(schema)
  .load()

这种方法对我有用，内容对象包含文件的实际内容，从那里您可以简单地将其转换为所需的任何最终对象。

Spark流式传输整个文本文件

2 个答案: