我有一个名为sampleOrder.csv的CSV文件,如下所示:
CarrierName,CarrierCustomerNumber,CarrierReference,CustomerReference,TransportDate,postcode,ProductDescription,ServiceDescription
DPD UK,260432,1.5503E+13,JO01974834,1/14/2013,LU7 4QT,PARCEL,NXTDAY
DPD UK,260364,1.55011E+13,C015800315,12/31/2012,BS3 5DH,PARCEL,NXTDAY
DPD UK,260268, 15501675752897R,953902,1/15/2013,CV10 7RL,REVERSE
IT,NXTDAY DPD UK,260162,1.55017E+13,C015889556,1/14/2013,IP13
6ET,PARCEL, NXTDAY DPD UK,260364,1.55011E+13,C015939958,1/21/2013,SW6
7JY,PARCEL, NXTDAY DPD
UK,260363,1.55012E+13,C015854701,1/10/2013,RG41 2AN,PARCEL, NXTDAY
DPD UK,260364,1.55011E+13,C015945032,1/22/2013,RG5 4JB,PARCEL,
NXTDAY DPD UK,260268,1.55017E+13,967819,1/11/2013, HD1 2QE,PARCEL,
NXTDAY DPD UK,260364,1.55011E+13,C015966537,1/24/2013,ST1 6SL,HOME
DELIVERY,AFNOON DPD UK,260364,
15500557912288R,C015821652,1/4/2013,CV10 7RL,SWAPIT,NXTDAY
我创建了一个spark SQL上下文,我将csv文件加载到这样的数据框中:
val OrdersRAW = spark.read
.format("csv")
.option("header", "true")
.option("mode", "DROPMALFORMED")
.csv("Order_201301.csv")
我现在想要加载文件中的所有列,并且只想提取邮政编码的第一部分并将其填充到另一列中。这是我正在努力的部分。
val ordersNew = OrdersRAW.select("CarrierName","CarrierCustomerNumber","CarrierReference","CustomerReference","TransportDate","postcode".substring(0,4).trim(),"ProductDescription","ServiceDescription")
有关如何实现这一点的任何想法?在此先感谢您的帮助。我使用Spark 2.0 +
答案 0 :(得分:1)
withColumn
的语法不正确。 [提示:检查文档] postcode
并替换列OrdersRAW.show
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
| CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate| postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
| DPD UK| 260432| 1.5503E+13| JO01974834| 1/14/2013| LU7 4QT| PARCEL| NXTDAY|
| DPD UK| 260364| 1.55011E+13| C015800315| 12/31/2012| BS3 5DH| PARCEL| NXTDAY|
| 6ET| PARCEL| NXTDAY DPD UK| 260364| 1.55011E+13|C015939958| 1/21/2013| SW6|
| UK| 260363| 1.55012E+13| C015854701| 1/10/2013| RG41 2AN| PARCEL| NXTDAY|
| DPD UK| 260364| 1.55011E+13| C015945032| 1/22/2013| RG5 4JB| PARCEL| |
|NXTDAY DPD UK| 260268| 1.55017E+13| 967819| 1/11/2013| HD1 2QE| PARCEL| |
+-------------+---------------------+----------------+-----------------+-------------+----------+------------------+------------------+
val ordersNew = OrdersRAW.withColumn("postcode", trim(substring($"postcode", 0, 4) ) )
scala> ordersNew.show
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
| CarrierName|CarrierCustomerNumber|CarrierReference|CustomerReference|TransportDate|postcode|ProductDescription|ServiceDescription|
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
| DPD UK| 260432| 1.5503E+13| JO01974834| 1/14/2013| LU7| PARCEL| NXTDAY|
| DPD UK| 260364| 1.55011E+13| C015800315| 12/31/2012| BS3| PARCEL| NXTDAY|
| 6ET| PARCEL| NXTDAY DPD UK| 260364| 1.55011E+13| C015| 1/21/2013| SW6|
| UK| 260363| 1.55012E+13| C015854701| 1/10/2013| RG41| PARCEL| NXTDAY|
| DPD UK| 260364| 1.55011E+13| C015945032| 1/22/2013| RG5| PARCEL| |
|NXTDAY DPD UK| 260268| 1.55017E+13| 967819| 1/11/2013| HD1| PARCEL| |
+-------------+---------------------+----------------+-----------------+-------------+--------+------------------+------------------+
,而不是选择整个列列表。
{{1}}
答案 1 :(得分:0)
你可以像这样使用spark UDF:
import org.apache.spark.sql.functions._
val postcodePrefix = udf((s: String) => s.substring(0,4).trim())
OrdersRAW.withColumn("newColumnName", postcodePrefix(col("postcode")))