使用Apache Spark

时间:2018-04-09 17:24:02

标签: scala apache-spark apache-spark-sql spark-dataframe

我正在研究火花案例研究,我在hdfs中有csv文件,我正在处理spark上的数据。其中一列中的数据已合并。

例如,标题栏包含数据:

“EMS:背痛/伤害”。 EMS代表紧急情况和后:代表,紧急类型。在将csv加载到DF时,我需要在(:)(在本例中为EMS)之前仅加载数据。这是我的代码片段,但它加载了完整的标题列。你能帮我解决一下它的问题吗?

代码:

    val schema = StructType(Array(StructField("latitude", DoubleType, true), StructField("longitude",  DoubleType, true), StructField("desc", StringType, true), StructField("zip", StringType, true), StructField("title", StringType, true), StructField("timeStamp", StringType, true), StructField ("twp", StringType, true),StructField("addr", StringType, true), StructField("e", IntegerType, true))) 

val df = spark.read.option("header","true").schema(schema).csv("hdfs://filepath/filename.csv") 

示例数据:

lat|lng|desc|zip|title|timeStamp|twp|addr|e

40.2978759|-75.5812935|REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;|19525|EMS: BACK PAINS/INJURY|12/10/2015 17:40|NEW HANOVER|REINDEER CT & DEAD END|1

40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS: DIABETIC EMERGENCY|12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1

40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;|19401|Fire: GAS-ODOR/LEAK|12/10/2015 17:40|NORRISTOWN|HAWS AVE|1

1 个答案:

答案 0 :(得分:0)

csvdelimiter

一起加载|
val data = spark.read
  .option("delimiter", "|")
  .option("header", true)
  .schema(schema)
  .csv(path)
  //split the column title and get only befor : part
  .withColumn("title", split($"title", ":")(0))


data.show(false)

输出:

+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|latitude  |longitude  |desc                                                                               |zip  |title|timeStamp       |twp              |addr                      |e  |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+
|40.2978759|-75.5812935|REINDEER CT & DEAD END;  NEW HANOVER; Station 332; 2015-12-10 @ 17:10:52;          |19525|EMS  |12/10/2015 17:40|NEW HANOVER      |REINDEER CT & DEAD END    |1  |
|40.2580614|-75.2646799|BRIAR PATH & WHITEMARSH LN;  HATFIELD TOWNSHIP; Station 345; 2015-12-10 @ 17:29:21;|19446|EMS  |12/10/2015 17:40|HATFIELD TOWNSHIP|BRIAR PATH & WHITEMARSH LN|1  |
|40.1211818|-75.3519752|HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-Station:STA27;                         |19401|Fire |12/10/2015 17:40|NORRISTOWN       |HAWS AVE                  |1  |
+----------+-----------+-----------------------------------------------------------------------------------+-----+-----+----------------+-----------------+--------------------------+---+

希望这有帮助!