如何删除pyspark列标题中的空白以及如何将字符串日期转换为日期时间格式

时间:2019-10-25 07:55:25

标签: pyspark pyspark-sql pyspark-dataframes

-我是pyspark的新手,我正在尝试删除空白,在将日期字符串类型转换为未转换的DateTime格式之后,我将不被删除。请帮我怎么做。

我尝试过:

emp=spark.read.csv("Downloads/dataset2/employees.csv",header=True)
dd=list(map(lambda x: x.replace(" ",""),emp.columns))
df=emp.toDF(*dd)


  +----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+
|EmployeeID| LastName|  FirstName|               Title|TitleOfCourtesy|           BirthDate|            HireDate|             Address|     City| Region| PostalCode| Country|       HomePhone| Extension|               Photo|               Notes| ReportsTo|           PhotoPath|
+----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+
|         1|'Davolio'|    'Nancy'|'Sales Representa...|          'Ms.'|'1948-12-08 00:00...|'1992-05-01 00:00...|'507 - 20th Ave. ...|'Seattle'|   'WA'|    '98122'|   'USA'|'(206) 555-9857'|    '5467'|'0x151C2F00020000...|'Education includ...|         2|'http://accweb/em...|
|         2| 'Fuller'|   'Andrew'|'Vice President S...|          'Dr.'|'1952-02-19 00:00...|'1992-08-14 00:00...|'908 W. Capital Way'| 'Tacoma'|   'WA'|    '98401'|   'USA'|'(206) 555-9482'|    '3457'|'0x151C2F00020000...|'Andrew received ...|      NULL|'http://accweb/em...|
+----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+

之后尝试了此操作,但显示错误:

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

emp.select("BirthDate").show()
Py4JJavaError: An error occurred while calling o197.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`BirthDate`' given input columns: [ PhotoPath, EmployeeID,  Photo,  City,  HomePhone,  ReportsTo,  PostalCode,  Title,  Address, Notes,  LastName,   FirstName,  HireDate,  Region,  Extension,  Country,  BirthDate, TitleOfCourtesy];;

之后,我尝试了此操作:

df=emp.withColumn('BirthDate', from_unixtime(unix_timestamp('BirthDate','yyyy-mm-dd')))

但显示空值:

df.select("BirthDate").show(4)
+---------+
|BirthDate|
+---------+
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
|     null|
+---------+

1 个答案:

答案 0 :(得分:0)

尝试

for each in df.columns:
   df = df.withColumnRenamed(each , each.strip())

日期向前:

df=emp.withColumn('BirthDate', from_unixtime(unix_timestamp('BirthDate','yyyy-mm-dd')))