-我是pyspark的新手,我正在尝试删除空白,在将日期字符串类型转换为未转换的DateTime格式之后,我将不被删除。请帮我怎么做。
我尝试过:
emp=spark.read.csv("Downloads/dataset2/employees.csv",header=True)
dd=list(map(lambda x: x.replace(" ",""),emp.columns))
df=emp.toDF(*dd)
+----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+
|EmployeeID| LastName| FirstName| Title|TitleOfCourtesy| BirthDate| HireDate| Address| City| Region| PostalCode| Country| HomePhone| Extension| Photo| Notes| ReportsTo| PhotoPath|
+----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+
| 1|'Davolio'| 'Nancy'|'Sales Representa...| 'Ms.'|'1948-12-08 00:00...|'1992-05-01 00:00...|'507 - 20th Ave. ...|'Seattle'| 'WA'| '98122'| 'USA'|'(206) 555-9857'| '5467'|'0x151C2F00020000...|'Education includ...| 2|'http://accweb/em...|
| 2| 'Fuller'| 'Andrew'|'Vice President S...| 'Dr.'|'1952-02-19 00:00...|'1992-08-14 00:00...|'908 W. Capital Way'| 'Tacoma'| 'WA'| '98401'| 'USA'|'(206) 555-9482'| '3457'|'0x151C2F00020000...|'Andrew received ...| NULL|'http://accweb/em...|
+----------+---------+-----------+--------------------+---------------+--------------------+--------------------+--------------------+---------+-------+-----------+--------+----------------+----------+--------------------+--------------------+----------+--------------------+
之后尝试了此操作,但显示错误:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
emp.select("BirthDate").show()
Py4JJavaError: An error occurred while calling o197.select.
: org.apache.spark.sql.AnalysisException: cannot resolve '`BirthDate`' given input columns: [ PhotoPath, EmployeeID, Photo, City, HomePhone, ReportsTo, PostalCode, Title, Address, Notes, LastName, FirstName, HireDate, Region, Extension, Country, BirthDate, TitleOfCourtesy];;
之后,我尝试了此操作:
df=emp.withColumn('BirthDate', from_unixtime(unix_timestamp('BirthDate','yyyy-mm-dd')))
但显示空值:
df.select("BirthDate").show(4)
+---------+
|BirthDate|
+---------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+---------+
答案 0 :(得分:0)
尝试
for each in df.columns:
df = df.withColumnRenamed(each , each.strip())
日期向前:
df=emp.withColumn('BirthDate', from_unixtime(unix_timestamp('BirthDate','yyyy-mm-dd')))