Question

我有一个包含多列的数据框，我创建了一个代码片段来生成一个新列'label'，它代表一个包含其余列信息的字符串。

想象一下，我有下一个数据帧'df'：

+------------+------+
| Atr0       | Atr1 |
+------------+------+
|     0      |  1   |
+------------+------+
|     0      |  2   |
+------------+------+
|     0      |  8   |
+------------+------+
|     ...    |  ... |
+------------+------+

所以，我想创建一个新的数据帧'df'，它是：

+------------+------+-------------------+
| Atr0       | Atr1 |        Atr2       |  
+------------+------+-------------------+
|     0      |  1   | Atr0='0', Atr1='1'|
+------------+------+-------------------+
|     0      |  2   | Atr0='0', Atr1='2'|
+------------+------+-------------------+
|     0      |  8   | Atr0='0', Atr1='8'|
+------------+------+-------------------+
|     ...    |  ... |        ...        |
+------------+------+-------------------+

如果我执行下一个代码片段，它可以工作：

df = df.withColumn('Atr2', concat(lit("Atr0="), col('Atr0'), lit(", Atr1="), col('Atr1'))

但我希望代码不依赖于用户。我的意思是，我有一个包含数据帧列的变量'list'，所以我想迭代该列表以自动生成新列。这里是我生成的代码：

def addColumn(list):

    # Many operations to create the dataframe 'df'
    list = list + ['AtrX']
    # list is a variable containing ['Atr0', 'Atr1', 'AtrX']
    label = "df = df.withColumn('label', concat("

    count_aux = 0
    for atributo_aux in list:
        if(count_aux == 0):
            label = label + "lit('" + atributo_aux + "='), col('" + atributo_aux + "')" 
        else:
            label = label + ", lit('," + atributo_aux + "='), col('" + atributo_aux + "')"
        count_aux += 1

    label = label + "))" 
    print(label)
    exec(label)

但是当我执行该函数时，数据帧永远不会更新。我检查了字符串是否正确生成了它。为什么我执行代码手动将列名完全更新，并且在上面生成的代码中没有？

在pyspark中使用exec（）为数据框创建列不起作用

0 个答案: