Question

data frame中有pyspark。它有id，name，city和country列我有一个list，其中包含一些names。如果test在列表中，那么我想在名为name的数据框中添加一个新列，然后Y其他N。

我在下面做过。我创建了一个函数operation。

def new_column(df, list):
    df1 = df.withColumn('test', when(df.name.isin(list), "Y").otherwise('N'))
    return df1

然后我会调用

new_df = new_column(df, list)

然后我的new_df包含test列，并根据指定的isin条件填充值。

现在我想在我的脚本中的不同位置使用相同的功能。

我有一个名为cities的列表，其中包含city个名称，列表to_visit包含country个名称。

假设我想通过检查数据框中的某些列来创建具有不同新列的多个数据框。

例如，我想这样做。

In data frame check for column say city and populate new column city_visited.
In data frame check for column say country and populate new column bucket_list

在上面我创建了new_df，然后在做了一些transformations后，我会有一个名为full_df的数据框。在此full_df我想填写city_visited列，如下所示。

city_visited_df = new_column(full_df, cities)

然后在这个city_visited_df我会做一些transformations我会有一个名为secure_df的数据框。在此secure_df我想填写bucket_list列，如下所示。

bucket_list_df = new_column(secure_df, to_visit)

基本上我想做的是传递我要添加的列名和要检查的功能列。

def new_column(df, list, column_to_add, column_to_check):
    df1 = df.withColumn('column_to_add', when(df.column_to_check.isin(list), "Y").otherwise('N'))
    return df1

这可能吗？如果有可能我该怎么做？

根据@ pault的评论进行编辑

def new_column(df, list, column_to_add, column_to_check):
    df1 = df.withColumn('column_to_add', when(df[column_to_check].isin(list), "Y").otherwise('N'))
    return df1

我收到以下错误。

NameError: name 'column_to_add' is not defined

Answer 1

您已定义变量。

SELECT 
    DATE_FORMAT(t1.order_date, '%Y-%m') AS month,
    sum(t1.received_amt) as SumOfNO,
    IFNULL(t2.Amount, 0) as SumOfSM,
    sum(t1.received_amt) + IFNULL(t2.Amount, 0) AS Total
FROM `new_order` t1
LEFT JOIN
 (  select YEAR(t2.sell_date) AS year, MONTH(t2.sell_date) AS month,  sum(total_amount) as Amount
    from sell_master t2
    group by year, month
 ) t2
 ON YEAR(t1.order_date) = t2.year AND MONTH(t1.order_date) = month
GROUP BY month
ORDER BY month DESC

你的功能就像在这里

column_to_add = 'secure_id'
column_to_check = 'device_model'

然后调用你的函数

def new_column(df, list, column_to_add, column_to_check):
    df1 = df.withColumn(column_to_add, when(df[column_to_check].isin(list), "Y").otherwise('N'))
    return df1

然后你不会得到任何错误

使用withColumn时，将随机变量传递给pyspark

1 个答案: