Question

我的序列文件的密钥为LongWritable或Text。这些值都是相同的格式（json）。我想在一个火花作业中一次处理所有这些，但我无法弄清楚如何编写代码，因此它适用于Text和LongWritable个键。我实际上甚至不关心我工作中的序列记录键，我没有使用它们。

这是我为LongWritable所做的事情。如何增强它以适用于LongWritable和Text键？有没有办法只加载序列文件记录值并忽略键？

val rdd = sparkCtx.sequenceFile[Long, String](srcDir)

// put into Json records, don't care about seq key
val jsonRecs = rdd.map((record: (Long, String)) => new String(record._2))

Answer 1

这是我的NullWritable解决方案，适用于Text和LongWritable seq文件密钥。

我在本地测试期间从本地文本文件读取，在群集上运行时从HDFS读取。

     var rdd = if (inputFileType.equalsIgnoreCase(InputFileType_Text)) {
        // Read local text file
        // Tried using a NullWritable here for local testing, but it throws
        // a 'Not Serializable' error.  Using null instead.
        sparkCtx.textFile(srcDir).map(line => {
           val tokens = line.split("\t")
           (null, tokens(1))
        })
     } else  {
        // Default to assuming sequence files are input
        // Read HDFS directory of seq files.
        log.debug("SEQUENCE files, srcDir={}", srcDir)
        sparkCtx.sequenceFile[NullWritable, String](srcDir)
     }
     log.debug("LOADED: rdd<NullWritable,String>")

     // Json records
     val jsonRecs = rdd.map((record: (NullWritable, String)) => new String(record._2))

Spark Scala - 读取具有多种键类型的序列文件？

1 个答案: