如何使用Spark中的拆分列和记录中可用的分隔符

时间:2017-12-20 15:29:55

标签: scala apache-spark apache-spark-sql

我有以下数据集

Segment.organizationId|^|Segment.segmentId|^|SegmentType|^|SegmentName|^|SegmentName.languageId|^|SegmentLocalLanguageLabel|^|SegmentLocalLanguageLabel.languageId|^|ValidFromPeriodEndDate|^|ValidToPeriodEndDate|^|SegmentInactivationReasonCode|^|SegmentOrganizationId|^|IsShariaCompliant|^|IsCorporate|^|IsElimination|^|IsOther|^|InactiveReasonOtherDescription|^|InactiveReasonOtherDescription.languageId|^|IsOperatingSegment|^|SegmentFundbDescription|^|SegmentFundbDescription.languageId|^|SegmentTypeId|^|SegmentInactiveReasonId|^|FFAction|!|
4295876080|^|7|^|B|^|Test ||^|505074|^|jtrsu|^|505126|^|2010-03-31T00:00:00Z|^||^||^||^|False|^|False|^|False|^|False|^||^|505074|^|False|^||^|505074|^|3013618|^||^|I|!|

这是我的代码

val df = sqlContext.read.format("csv").option("header", "true").option("delimiter", "|").option("inferSchema","true").load("s3://trfsmallfffile/FinancialSegment/TEST")

但这并没有给我正确的输出

这是我的输出

+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|DataPartition|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+
|            4295876080|                7|          B|      Test |                     ^|                        ^|                                   ^|                     ^|                   ^|                            ^|                    ^|                ^|          ^|            ^|      ^|                             ^|                                        ^|                 ^|                      ^|                                 ^|            ^|                      ^|       ^|        Japan|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+--------+-------------+

我得到了这个,因为记录中使用了|字符。

我该如何处理这种情况?

我的预期输出低于

...+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|Segment.organizationId|Segment.segmentId|SegmentType|SegmentName|SegmentName.languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel.languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription.languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription.languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|4295876080            |7                |B          |Test |     |505074                |jtrsu                    |505126                              |2010-03-31T00:00:00Z  |                    |                             |                     |False            |False      |False        |False  |                              |505074                                   |False             |                       |505074                            |3013618      |                       |I       |
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+

1 个答案:

答案 0 :(得分:3)

option参数中的 spark sql 不支持分隔符的多个字符。所以我建议您使用sparkContext,因为split函数支持多个字符。

因此,您的第一步是使用sparkContext

阅读文件
val rdd = sc.textFile("s3://trfsmallfffile/FinancialSegment/TEST")

然后你需要分开标题的第一行并从中创建schema

val header = rdd.filter(_.contains("Segment.organizationId")).map(line => line.split("\\|\\^\\|")).first()
val schema = StructType(header.map(cols => StructField(cols.replace(".", "_"), StringType)).toSeq)

最后一步是使用dataframe创建的

创建所需的schema
val data = sqlContext.createDataFrame(rdd.filter(!_.contains("Segment.organizationId")).map(line => Row.fromSeq(line.split("\\|\\^\\|").toSeq)), schema).show(false)

您应该关注dataframe

+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|Segment_organizationId|Segment_segmentId|SegmentType|SegmentName|SegmentName_languageId|SegmentLocalLanguageLabel|SegmentLocalLanguageLabel_languageId|ValidFromPeriodEndDate|ValidToPeriodEndDate|SegmentInactivationReasonCode|SegmentOrganizationId|IsShariaCompliant|IsCorporate|IsElimination|IsOther|InactiveReasonOtherDescription|InactiveReasonOtherDescription_languageId|IsOperatingSegment|SegmentFundbDescription|SegmentFundbDescription_languageId|SegmentTypeId|SegmentInactiveReasonId|FFAction|!||
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+
|4295876080            |7                |B          |Test |     |505074                |jtrsu                    |505126                              |2010-03-31T00:00:00Z  |                    |                             |                     |False            |False      |False        |False  |                              |505074                                   |False             |                       |505074                            |3013618      |                       |I|!|       |
+----------------------+-----------------+-----------+-----------+----------------------+-------------------------+------------------------------------+----------------------+--------------------+-----------------------------+---------------------+-----------------+-----------+-------------+-------+------------------------------+-----------------------------------------+------------------+-----------------------+----------------------------------+-------------+-----------------------+-----------+