修改pyspark的structfield中的列

时间:2018-04-05 09:47:23

标签: python apache-spark pyspark

我有一个带有架构的df:

root
 |-- AddressBook: struct (nullable = true)
 |    |-- ContactInformationsList: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- ContactId: string (nullable = true)
 |    |    |    |-- ContactMeansDesc: string (nullable = true)
 |    |    |    |-- IsPrimaryMeans: boolean (nullable = true)
 |    |    |    |-- TypeMeansContactId: string (nullable = true)
 |    |    |    |-- Value: string (nullable = true)
 |    |-- PersonData: struct (nullable = true)
 |    |    |-- BirthDate: string (nullable = true)
 |    |    |-- CSP: string (nullable = true)
 |    |    |-- Civility: string (nullable = true)
 |    |    |-- FirstName: string (nullable = true)
 |    |    |-- Gender: string (nullable = true)
 |    |    |-- LastName: string (nullable = true)
 |    |    |-- MaritalStatus: string (nullable = true)
 |    |    |-- SBirthDate: string (nullable = true)
 |    |    |-- Title: string (nullable = true)
 |-- PublicId: string (nullable = true)
 |-- Version: long (nullable = true)

此数据框会生成prod数据,因此我想更改一些个人信息。基本上,使用值的哈希值替换列AddressBook.Persondata.Lastname

我试过了:

df.withColumn(
    'AddressBook.Persondata.Lastname', 
    F.hash(F.col('AddressBook.Persondata.Lastname'))
)

但它刚刚添加了另一栏:

|-- AddressBook.Persondata.Lastname: int (nullable = true)

有一种简单的方法可以修改我的数据吗?

0 个答案:

没有答案