基本上我需要基于DataPartition
列创建输出文件。数据框中的最后一列
因此第一行和最后一行将保存在Fundamental.Fundamental.Fundamental.Japan.1.2018-09-24-0937.Full.txt
中
并且中间行将保存在Fundamental.Fundamental.Fundamental.ThirdParty.1.2018-09-24-0937.Full.txt
+--------------------------------+--------------+---------------------------+-------------------------+--------+------------------+--------+-----------------+---------------+--------------------------+---------------------------+---------------+-------------------------+-----------------------------+-----------------------------------+-----------------------+----------------------------+----------------------------------+--------------------+----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------+
|Fundamental_uniqueFundamentalSet|OrganizationId|OrganizationId_objectTypeId|OrganizationId_objectType|GaapCode|ConsolidationBasis|IsFiling|NonFilingDateTime|NonFilingReason|PrimaryReportingEntityCode|TotalPrimaryReportingShares|LocalLanguageId|Fundamental_effectiveFrom|Fundamental_effectiveFromPlus|Fundamental_effectiveFromPlusNaCode|Fundamental_effectiveTo|Fundamental_effectiveToMinus|Fundamental_effectiveToMinusNACode|ConsolidationBasisId|GaapCodeId|FFAction|!||DataPartition |
+--------------------------------+--------------+---------------------------+-------------------------+--------+------------------+--------+-----------------+---------------+--------------------------+---------------------------+---------------+-------------------------+-----------------------------+-----------------------------------+-----------------------+----------------------------+----------------------------------+--------------------+----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------+
|192730241374 |4295877894 |404010 |Organization |JPG |Consolidated |true | | |A51EF |117588807.00000 |505126 |2013-06-29T00:55:15Z | | |9999-12-31T00:00:00Z | | |3013598 |3011577 |I|!| |file:/C:/Users/u6034690/Desktop/SPARK/trfsmallfffile/Fundamental/FINALSPARK/Fundamental.Fundamental.Fundamental.Japan.1.2018-09-24-0937.Full.txt|
|192730391384 |4295877894 |404010 |Organization |AOG |Consolidated |true | | |A51EF |117588807.00000 |505126 |2018-09-19T09:51:46Z | | |9999-12-31T00:00:00Z | | |3013598 |1003042842|I|!| |file:/C:/Users/u6034690/Desktop/SPARK/trfsmallfffile/Fundamental/FINALSPARK/Fundamental.Fundamental.Fundamental.ThirdParty.1.2018-09-24-0937.Full.txt|
|192730241373 |4295877894 |404010 |Organization |JPG |Parent |true | | |A51EF |117588807.00000 |505126 |2013-06-29T00:55:15Z | | |9999-12-31T00:00:00Z | | |3013599 |3011577 |I|!| |file:/C:/Users/u6034690/Desktop/SPARK/trfsmallfffile/Fundamental/FINALSPARK/Fundamental.Fundamental.Fundamental.Japan.1.2018-09-24-0937.Full.txt|
+--------------------------------+--------------+---------------------------+-------------------------+--------+------------------+--------+-----------------+---------------+--------------------------+---------------------------+---------------+-------------------------+-----------------------------+-----------------------------------+-----------------------+----------------------------+----------------------------------+--------------------+----------+-----------+------------------------------------------------------------------------------------------------------------------------------------------------+
某种我正在寻找的东西无法正常工作。
import org.apache.hadoop.io.NullWritable
import org.apache.spark.HashPartitioner
import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat
class RddMultiTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
override def generateActualKey(key: Any, value: Any): Any = NullWritable.get()
override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = key.asInstanceOf[String]
}
dataframe.partitionBy(new HashPartitioner(noOfHashPartitioner)).saveAsHadoopFile(output, classOf[String], classOf[String], classOf[RddMultiTextOutputFormat], classOf[GzipCodec])
预期输出。
Fundamental.uniqueFundamentalSet|^|OrganizationId|^|OrganizationId.objectTypeId|^|OrganizationId.objectType|^|GaapCode|^|ConsolidationBasis|^|IsFiling|^|NonFilingDateTime|^|NonFilingReason|^|PrimaryReportingEntityCode|^|TotalPrimaryReportingShares|^|LocalLanguageId|^|Fundamental.effectiveFrom|^|Fundamental.effectiveFromPlus|^|Fundamental.effectiveFromPlusNaCode|^|Fundamental.effectiveTo|^|Fundamental.effectiveToMinus|^|Fundamental.effectiveToMinusNACode|^|ConsolidationBasisId|^|GaapCodeId|^|FFAction|!|
192730241373|^|4295877894|^|404010|^|Organization|^|JPG|^|Parent|^|True|^||^||^|A51EF|^|117588807.00000|^|505126|^|2013-06-29T00:55:15Z|^||^||^|9999-12-31T00:00:00Z|^||^||^|3013599|^|3011577|^|I|!|
192730241374|^|4295877894|^|404010|^|Organization|^|JPG|^|Consolidated|^|True|^||^||^|A51EF|^|117588807.00000|^|505126|^|2013-06-29T00:55:15Z|^||^||^|9999-12-31T00:00:00Z|^||^||^|3013598|^|3011577|^|I|!|
答案 0 :(得分:1)
您需要创建一个PairedRDD,其键为输出文件名,值为记录,然后可以调用saveAsHadoopFile()
以您想要的方式保存文件。
import org.json.JSONObject
val dataframe = .... //this is the dataframe that you want to save
val pairedRDD = dataframe.toJSON.rdd.map(row => {
val record = new JSONObject(row)
val key = record.getString("DataPartition")
(key, row)
})
pairedRDD.partitionBy(new HashPartitioner(noOfHashPartitioner))
.saveAsHadoopFile("", classOf[String], classOf[String], classOf[RddMultiTextOutputFormat])
这将为您提供所需的输出。