Spark Scala - 具有记录的数据集[array [string]] - 转换为从新行开始的记录

时间:2018-03-28 09:35:23

标签: scala apache-spark

我有Dataset[Array[String]]字符串

12345, 2341, a465c2a, p, 2015-06-10, 2015-02-23, 2015-02-23, 2, "", 1, 98941, 1, ., 17, 21, 1, "", 67890, 4313, a465c2a, p, 2015-06-10, 2015-02-23, 2015-02-23, 2, 7391, 1, 98941, 1, ., 17, 21, 1, 01

在此字符串中从零开始,记录结束于16位置第17个索引是新记录的开始。 如何将其保存为Spark中的文本文件,以便每个新记录都以新行开头。 我知道数据集可以保存为textFile,如write.text

1 个答案:

答案 0 :(得分:0)

这样做的一种方法是在sliding上使用Array[String]功能,并在"\n"的末尾附加String,因为您已经知道了//your original Dataset val data: Dataset[Array[String]] = sqlContext.createDataset(Seq(Array("12345", "2341", "a465c2a", "p", "2015-06-10", "2015-02-23", "2015-02-23", "2", " ", "1", "98941", "1", ".", "17", "21", "1", "67890", "4313", "a465c2a", "p", "2015-06-10", "2015-02-23", "2015-02-23", "2", "7391", "1", "98941", "1", ".", "17", "21", "1", "01"))) //apply sliding function to the Array and append \n val result: RDD[String] = data.rdd.map(_.sliding(17, 17).map(_.mkString(",") + "\n").mkString("")) //to display the output result.foreach(print(_)) //output //12345,2341,a465c2a,p,2015-06-10,2015-02-23,2015-02-23,2, ,1,98941,1,.,17,21,1,67890 //4313,a465c2a,p,2015-06-10,2015-02-23,2015-02-23,2,7391,1,98941,1,.,17,21,1,01 //to save the result to file result.saveAsTextFile("PATH_TO_SAVE_FILE") 的结束索引线。

New permissions added   
 Users who use the APK with version 18 code may need to accept the android.permission.READ_PHONE_STATE permission, which may make it impossible to upgrade to this version of the app.