PySpark中的条件语句

时间:2019-07-03 01:52:35

标签: python apache-spark pyspark pyspark-sql

我有一个包含900亿笔交易记录的数据框。数据框看起来像-

id          marital_status     age    new_class_desc      is_child          
1              Married          35    kids_sec                 0
2              Single           28    Other                    1
3              Married          32    Other                    1
5              Married          42    kids_sec                 0
2              Single           28    Other                    1
7              Single           27    kids_sec                 0

我希望数据框看起来像-

id       marital_status     age     is_child   new_class_desc    new_is_child          
1           Married          35        0       kids_sec            1
2           Single           28        0       Other               0
3           Married          32        1       Other               1
5           Married          42        0       kids_sec            1
2           Single           28        1       Other               1
7           Single           27        0       kids_sec            0

我已经按照以下方式使用python-

condition = ~((df['marital_status'] == 'Married') &\
            (df['new_class_desc'] == 'kids_sec') &\
            (df['age'] >= 33))

# Creating the new column, duping your original is_child.
df['new_col'] = df.loc[:, 'is_child']

# Applying your condition using df.where.
df.loc[:, 'new_col'] = df.where(condition, 1)
print(df)

如何使用pyspark进行相同的操作?

0 个答案:

没有答案