目前,我在Pyspark工作,对该技术了解甚少。我的数据框看起来像:
id dob var1
1 13-02-1976 aab@dfsfs
2 01-04-2000 bb@NAm
3 28-11-1979 adam11@kjfd
4 30-01-1955 rehan42@ggg
我的输出如下:
id dob var1 age var2
1 13-02-1976 aab@dfsfs 43 aab
2 01-04-2000 bb@NAm 19 bb
3 28-11-1979 adam11@kjfd 39 adam11
4 30-01-1955 rehan42@ggg 64 rehan42
我到目前为止所做的-
df= df.select( df.id.cast('int').alias('id'),
df.dob.cast('date').alias('dob'),
df.var1.cast('string').alias('var1'))
但是我认为dob
转换不正确。
df= df.withColumn('age', F.datediff(F.current_date(), df.dob))
答案 0 :(得分:0)
如您所说,浇铸多普勒柱是不正确的。请尝试这个。
from pyspark.sql.functions import col, unix_timestamp, to_date
import pyspark.sql.functions as F
df2 = df.withColumn('date_in_dateFormat',to_date(unix_timestamp(F.col('dob'),'dd-MM-
yyyy').cast("timestamp")))
df2.show()
+---+----------+-----------+------------------+
| id| dob| var1|date_in_dateFormat|
+---+----------+-----------+------------------+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|
| 2|01-04-2000| bb@NAm| 2000-04-01|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|
+---+----------+-----------+------------------+
df2.printSchema()
root
|-- id: integer (nullable = true)
|-- dob: string (nullable = true)
|-- var1: string (nullable = true)
|-- date_in_dateFormat: date (nullable = true)
df3= df2.withColumn('age', F.datediff(F.current_date(), df2.date_in_dateFormat))
df3.show()
+---+----------+-----------+------------------+-----+
| id| dob| var1|date_in_dateFormat| age|
+---+----------+-----------+------------------+-----+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|15789|
| 2|01-04-2000| bb@NAm| 2000-04-01| 6975|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|14405|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|23473|
+---+----------+-----------+------------------+-----+
split_col =F.split(df['var1'], '@')
df4=df3.withColumn('Var2', split_col.getItem(0))
df4.show()
+---+----------+-----------+------------------+-----+-------+
| id| dob| var1|date_in_dateFormat| age| Var2|
+---+----------+-----------+------------------+-----+-------+
| 1|13-02-1976| aab@dfsfs| 1976-02-13|15789| aab|
| 2|01-04-2000| bb@NAm| 2000-04-01| 6975| bb|
| 3|28-11-1979|adam11@kjfd| 1979-11-28|14405| adam11|
| 4|30-01-1955|rehan42@ggg| 1955-01-30|23473|rehan42|
+---+----------+-----------+------------------+-----+-------+