是否有任何功能可以帮助我在PySpark中转换日期和字符串格式

时间:2019-05-07 05:36:46

标签: python pyspark pyspark-sql

目前,我在Pyspark工作,对该技术了解甚少。我的数据框看起来像:

id       dob            var1
1       13-02-1976     aab@dfsfs
2       01-04-2000     bb@NAm
3       28-11-1979     adam11@kjfd
4       30-01-1955     rehan42@ggg

我的输出如下:

id       dob            var1             age           var2
1       13-02-1976     aab@dfsfs         43            aab
2       01-04-2000     bb@NAm            19            bb
3       28-11-1979     adam11@kjfd       39            adam11
4       30-01-1955     rehan42@ggg       64            rehan42

我到目前为止所做的-

df= df.select( df.id.cast('int').alias('id'),                                      
             df.dob.cast('date').alias('dob'),                                                                              
             df.var1.cast('string').alias('var1'))

但是我认为dob转换不正确。

df= df.withColumn('age', F.datediff(F.current_date(), df.dob))

1 个答案:

答案 0 :(得分:0)

如您所说,浇铸多普勒柱是不正确的。请尝试这个。

from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F

df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM- 
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id|       dob|       var1|date_in_dateFormat|
+---+----------+-----------+------------------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|
|  2|01-04-2000|     bb@NAm|        2000-04-01|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|
+---+----------+-----------+------------------+

df2.printSchema()
root
 |-- id: integer (nullable = true)
 |-- dob: string (nullable = true)
 |-- var1: string (nullable = true)
 |-- date_in_dateFormat: date (nullable = true)

df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id|       dob|       var1|date_in_dateFormat|  age|
+---+----------+-----------+------------------+-----+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|
+---+----------+-----------+------------------+-----+

split_col =F.split(df['var1'], '@')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id|       dob|       var1|date_in_dateFormat|  age|   Var2|
+---+----------+-----------+------------------+-----+-------+
|  1|13-02-1976|  aab@dfsfs|        1976-02-13|15789|    aab|
|  2|01-04-2000|     bb@NAm|        2000-04-01| 6975|     bb|
|  3|28-11-1979|adam11@kjfd|        1979-11-28|14405| adam11|
|  4|30-01-1955|rehan42@ggg|        1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+