答案 0 :(得分:0)
如果我没有被误解,那么我相信你正在寻找这个:
import pyspark.sql.functions as f
from pyspark.sql.types import DateType
from datetime import datetime
#col1 has date format in DDMMYYY and col2 has date format in MMDDYYYY
df = sc.parallelize([('30082017','08272017'), ('29082017','08262017')]).toDF(["col1", "col2"])
f_mmdd = f.udf(lambda x: datetime.strptime(x, '%m%d%Y'), DateType())
f_ddmm = f.udf(lambda x: datetime.strptime(x, '%d%m%Y'), DateType())
df = df.withColumn("col1_date_ddmm",f_ddmm(df.col1)).withColumn("col2_date_mmdd",f_mmdd(df.col2))
df.show()
输出是:
+--------+--------+--------------+--------------+
| col1| col2|col1_date_ddmm|col2_date_mmdd|
+--------+--------+--------------+--------------+
|30082017|08272017| 2017-08-30| 2017-08-27|
|29082017|08262017| 2017-08-29| 2017-08-26|
+--------+--------+--------------+--------------+
希望这有帮助!