Question

我有一份员工及其休假记录的数据集。每个记录（类型 EmployeeRecord ）都包含EmpID（类型为 String ）和其他字段。我从文件中读取记录，然后转换为PairRDDFunctions：

val empRecords = sc.textFile(args(0))
....

val empsGroupedByEmpID = this.groupRecordsByEmpID(empRecords)

此时，'empsGroupedByEmpID'的类型为RDD [String，Iterable [EmployeeRecord]]。我将其转换为PairRDDFunctions：

val empsAsPairRDD = new PairRDDFunctions[String,Iterable[EmployeeRecord]](empsGroupedByEmpID)

然后，我按照应用程序的逻辑处理记录。最后，我得到了一个类型为[Iterable [EmployeeRecord]]

的RDD

val finalRecords: RDD[Iterable[EmployeeRecord]] = <result of a few computations and transformation>

当我尝试使用可用的API将此RDD的内容写入文本文件时：

finalRecords.saveAsTextFile("./path/to/save")

我发现在文件中每个记录都以ArrayBuffer（...）开头。我需要的是每行中有一个EmployeeRecord的文件。这不可能吗？我错过了什么吗？

Answer 1

我发现了丢失的API。很好......平面地图！： - ）

通过使用带有标识的flatMap，我可以摆脱Iterator并解包＆＃39;内容如下：

finalRecords.flatMap(identity).saveAsTextFile("./path/to/file")

这解决了我一直遇到的问题。

我也发现这post暗示同样的事情。我希望我早一点看到它。