我正在研究火花案例研究,我在hdfs中有csv文件,我正在处理spark上的数据。其中一列中的数据已合并。
例如,标题栏包含数据:
“EMS:背痛/伤害”。 EMS代表紧急情况和后:代表,紧急类型。在将csv加载到DF时,我需要在(:)(在本例中为EMS)之前仅加载数据。这是我的代码片段,但它加载了完整的标题列。你能帮我解决一下它的问题吗?
代码:
val schema = StructType(Array(StructField("latitude", DoubleType, true), StructField("longitude", DoubleType, true), StructField("desc", StringType, true), StructField("zip", StringType, true), StructField("title", StringType, true), StructField("timeStamp", StringType, true), StructField ("twp", StringType, true),StructField("addr", StringType, true), StructField("e", IntegerType, true)))
val df = spark.read.option("header","true").schema(schema).csv("hdfs://filepath/filename.csv")
#
示例数据:
lat|lng|desc|zip|title|timeStamp|twp|addr|e
40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;|19525|EMS: BACK PAINS/INJURY|12/10/2015 17:40|NEW HANOVER|REINDEER CT & DEAD END|1
40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1
40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;|19401|Fire: GAS-ODOR/LEAK|12/10/2015 17:40|NORRISTOWN|HAWS AVE|1
答案 0 :(得分:0)
将csv
与delimiter
|
val data = spark.read
.option("delimiter", "|")
.option("header", true)
.schema(schema)
.csv(path)
//split the column title and get only befor : part
.withColumn("title", split($"title", ":")(0))
data.show(false)
输出:
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|latitude |longitude |desc |zip |title|timeStamp |twp |addr |e |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END; NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52; |19525|EMS |12/10/2015 17:40|NEW HANOVER |REINDEER CT & DEAD END |1 |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS |12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1 |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27; |19401|Fire |12/10/2015 17:40|NORRISTOWN |HAWS AVE |1 |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
希望这有帮助!