我有一个包含多列的数据框,我创建了一个代码片段来生成一个新列'label',它代表一个包含其余列信息的字符串。
想象一下,我有下一个数据帧'df':
+------------+------+
| Atr0 | Atr1 |
+------------+------+
| 0 | 1 |
+------------+------+
| 0 | 2 |
+------------+------+
| 0 | 8 |
+------------+------+
| ... | ... |
+------------+------+
所以,我想创建一个新的数据帧'df',它是:
+------------+------+-------------------+
| Atr0 | Atr1 | Atr2 |
+------------+------+-------------------+
| 0 | 1 | Atr0='0', Atr1='1'|
+------------+------+-------------------+
| 0 | 2 | Atr0='0', Atr1='2'|
+------------+------+-------------------+
| 0 | 8 | Atr0='0', Atr1='8'|
+------------+------+-------------------+
| ... | ... | ... |
+------------+------+-------------------+
如果我执行下一个代码片段,它可以工作:
df = df.withColumn('Atr2', concat(lit("Atr0="), col('Atr0'), lit(", Atr1="), col('Atr1'))
但我希望代码不依赖于用户。我的意思是,我有一个包含数据帧列的变量'list',所以我想迭代该列表以自动生成新列。这里是我生成的代码:
def addColumn(list):
# Many operations to create the dataframe 'df'
list = list + ['AtrX']
# list is a variable containing ['Atr0', 'Atr1', 'AtrX']
label = "df = df.withColumn('label', concat("
count_aux = 0
for atributo_aux in list:
if(count_aux == 0):
label = label + "lit('" + atributo_aux + "='), col('" + atributo_aux + "')"
else:
label = label + ", lit('," + atributo_aux + "='), col('" + atributo_aux + "')"
count_aux += 1
label = label + "))"
print(label)
exec(label)
但是当我执行该函数时,数据帧永远不会更新。我检查了字符串是否正确生成了它。为什么我执行代码手动将列名完全更新,并且在上面生成的代码中没有?