我需要读取ASCII文件,如下所示。一行中的每个项目都由一些分隔符(例如逗号或冒号)分隔。
ROUND 079675882 1446320661365001Y00000 000M 8019721193 ROUND 6613-144632 000875 <EOR>
ROUND 079675882 1446320661365001Y10000 S10 ROUND 875079675882 144632 11180524180525XYZSONS1 21130 8019721193 ROUND 1805241300000000000087500000000180524144632 XYZSONS COMPANIES, LLC 9 0091372096500NATIONAL SERVICES CENTER P.O. BOX 29093 AZAD AZ85038 BUGASON A SUB. OF ALBERTSONS, LLC 9 0091372096613 <EOR>
ROUND 079675882 1446320661365001Y20000 S11 Boundaris GHBC 3649 F Public Court Cian ID83642 HELTHY HEALTHCARE LLC 9 079675882 1190 OMEGA DR. MANGO PA15205 0100BDDARYL BHINDI 2088874065 TENOT USED 02180605GEN TRUCK 0258220026501 <EOR>
ROUND 079675882 1446320661365001Y30000 S12 0000034CA00000178LB00000000000000000000000000181450000000000000NPO BOX 826614 - ABS AP UGANDA, PA PPM 018889974498GEN GEN GENZZ1 GENZZ2 GEN GEN GENZZ3 GENZZ4 GENZZ5 <EOR>
我使用该代码,但这不起作用:
val DataReaderDF = spark.read
// i am not sure if this delimiter is ok or not
// to be used in my ASCII file input source
.option("delimiter", "\r\n\r\n")
.option("header", false)
.text("/example_data/InputFile/20180524_840860__PO_D20180524130814_TXT")
如何加载此类数据集?
答案 0 :(得分:0)
处理此类文件的一种方法是将其加载为文本文件(使用text
数据源),并split
用空格或文件使用的分隔符(例如)填充行。 >
val entireFileAsSingleColumn = spark.read.text("tab-separated.txt")
scala> entireFileAsSingleColumn.printSchema
root
|-- value: string (nullable = true)
val splitLines = entireFileAsSingleColumn.withColumn("split", split('value, "\\s+"))
scala> splitLines.printSchema
root
|-- value: string (nullable = true)
|-- split: array (nullable = true)
| |-- element: string (containsNull = true)
// use as many $"split" as a line has elements
val solution = splitLines.select($"split"(0) as "round", $"split"(1) as "num")
scala> solution.show
+-----+---------+
|round| num|
+-----+---------+
|ROUND|079675882|
|ROUND|079675882|
|ROUND|079675882|
|ROUND|079675882|
| | null|
+-----+---------+