Spark Scala - DataFrames& csv - 部分提取值

时间:2017-11-09 10:30:46

标签: scala csv apache-spark

我有一个名为sampleOrder.csv的CSV文件,如下所示:

CarrierName,CarrierCustomerNumber,CarrierReference,CustomerReference,TransportDate,postcode,ProductDescription,ServiceDescription
DPD UK,260432,1.5503E+13,JO01974834,1/14/2013,LU7 4QT,PARCEL,NXTDAY
DPD UK,260364,1.55011E+13,C015800315,12/31/2012,BS3  5DH,PARCEL,NXTDAY
DPD UK,260268, 15501675752897R,953902,1/15/2013,CV10 7RL,REVERSE
IT,NXTDAY DPD UK,260162,1.55017E+13,C015889556,1/14/2013,IP13
6ET,PARCEL,  NXTDAY DPD UK,260364,1.55011E+13,C015939958,1/21/2013,SW6
7JY,PARCEL,  NXTDAY DPD
UK,260363,1.55012E+13,C015854701,1/10/2013,RG41 2AN,PARCEL,  NXTDAY
DPD UK,260364,1.55011E+13,C015945032,1/22/2013,RG5  4JB,PARCEL, 
NXTDAY DPD UK,260268,1.55017E+13,967819,1/11/2013, HD1 2QE,PARCEL, 
NXTDAY DPD UK,260364,1.55011E+13,C015966537,1/24/2013,ST1  6SL,HOME
DELIVERY,AFNOON DPD UK,260364,
15500557912288R,C015821652,1/4/2013,CV10 7RL,SWAPIT,NXTDAY

我创建了一个spark SQL上下文,我将csv文件加载到这样的数据框中:

val OrdersRAW = spark.read
                      .format("csv")
                      .option("header", "true")
                      .option("mode", "DROPMALFORMED")
                      .csv("Order_201301.csv")

我现在想要加载文件中的所有列,并且只想提取邮政编码的第一部分并将其填充到另一列中。这是我正在努力的部分。

val ordersNew = OrdersRAW.select("CarrierName","CarrierCustomerNumber","CarrierReference","CustomerReference","TransportDate","postcode".substring(0,4).trim(),"ProductDescription","ServiceDescription")

有关如何实现这一点的任何想法?在此先感谢您的帮助。我使用Spark 2.0 +

2 个答案:

答案 0 :(得分:1)

  1. 不需要UDF。这两个功能都是内部可用的
  2. withColumn的语法不正确。 [提示:检查文档]
  3. 您只需使用postcode并替换列OrdersRAW.show +-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+ | CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate| postcode|ProductDescription|ServiceDescription| +-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+ | DPD UK| 260432| 1.5503E+13| JO01974834| 1/14/2013| LU7 4QT| PARCEL| NXTDAY| | DPD UK| 260364| 1.55011E+13| C015800315| 12/31/2012| BS3 5DH| PARCEL| NXTDAY| | 6ET| PARCEL| NXTDAY DPD UK| 260364| 1.55011E+13|C015939958| 1/21/2013| SW6| | UK| 260363| 1.55012E+13| C015854701| 1/10/2013| RG41 2AN| PARCEL| NXTDAY| | DPD UK| 260364| 1.55011E+13| C015945032| 1/22/2013| RG5 4JB| PARCEL| | |NXTDAY DPD UK| 260268| 1.55017E+13| 967819| 1/11/2013| HD1 2QE| PARCEL| | +-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+ val ordersNew = OrdersRAW.withColumn("postcode", trim(substring($"postcode", 0, 4) ) ) scala> ordersNew.show +-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+ | CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|postcode|ProductDescription|ServiceDescription| +-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+ | DPD UK| 260432| 1.5503E+13| JO01974834| 1/14/2013| LU7| PARCEL| NXTDAY| | DPD UK| 260364| 1.55011E+13| C015800315| 12/31/2012| BS3| PARCEL| NXTDAY| | 6ET| PARCEL| NXTDAY DPD UK| 260364| 1.55011E+13| C015| 1/21/2013| SW6| | UK| 260363| 1.55012E+13| C015854701| 1/10/2013| RG41| PARCEL| NXTDAY| | DPD UK| 260364| 1.55011E+13| C015945032| 1/22/2013| RG5| PARCEL| | |NXTDAY DPD UK| 260268| 1.55017E+13| 967819| 1/11/2013| HD1| PARCEL| | +-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+ ,而不是选择整个列列表。
  4. {{1}}

答案 1 :(得分:0)

你可以像这样使用spark UDF:

 import org.apache.spark.sql.functions._
 val postcodePrefix = udf((s: String) => s.substring(0,4).trim())
 OrdersRAW.withColumn("newColumnName", postcodePrefix(col("postcode")))