我有多个已经存在的pyspark数据框。我需要向每个数据框添加一个新列。我可以对解决方案进行“硬编码”,并且可以正常工作。但是,当我尝试使用for循环将列添加到所有数据帧时,出现错误。该错误表明“ withColumn”不是数据框的属性。我不明白这个错误。
我在下面键入的内容是我尝试过的内容。我是python和pyspark新手。
from pyspark.sql import *
Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Sam', gender='M', height = '77',age = '42',weight = '190')
input2 = Row(firstname='Diane', gender='F', height = '70',age = '21', weight ='110')
input3 = Row(firstname='Norm',gender='M', height = '68',age = '33',weight = '240')
input4 = Row(firstname='Carla', gender='F', height = '60',age = '29',weight = '90')
bios1 = Row(input1, input2, input3, input4)
bios1_sdf = spark.createDataFrame(bios1)
Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Chandler', gender='M', height = '72',age = '109',weight = '270')
input2 = Row(firstname='Monica', gender='F', height = '64',age = '10', weight ='123')
input3 = Row(firstname='Ross',gender='M', height = '74',age = '59',weight = '168')
input4 = Row(firstname='Phoebe', gender='F', height = '64',age = '2',weight = '20')
bios2 = Row(input1, input2, input3, input4)
bios2_sdf = spark.createDataFrame(bios2)
bios1_sdf = bios1_sdf.withColumn('bmi',(bios1_sdf.weight/(bios1_sdf.height * bios1_sdf.height)*703))
print("data in bios1_sdf")
bios1_sdf.show()
bios2_sdf = bios2_sdf.withColumn('bmi',(bios2_sdf.weight/(bios2_sdf.height * bios2_sdf.height)*703))
print("data in bios2_sdf")
bios2_sdf.show()
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
| 42| Sam| M| 77| 190| 22.52825096980941|
| 21| Diane| F| 70| 110|15.781632653061223|
| 33| Norm| M| 68| 240| 36.4878892733564|
| 29| Carla| F| 60| 90| 17.575|
+---+---------+------+------+------+------------------+
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
|109| Chandler| M| 72| 270|36.614583333333336|
| 10| Monica| F| 64| 123| 21.110595703125|
| 59| Ross| M| 74| 168| 21.56756756756757|
| 2| Phoebe| F| 64| 20| 3.4326171875|
+---+---------+------+------+------+------------------+
dict_of_df ={}
list_of_sdf = [bios1_sdf, bios2_sdf]
for i in range(1,2+1):
# name of dataframe
key_name_in = 'bios'+str(i)+'_sdf'
dict_of_df[key_name_in] = list_of_sdf[i-1]
temp_sdf = dict_of_df[key_name_in]
# add bmi
dict_of_df[key_name_in] = temp_sdf.withcolumn('bmi',sdf.weight/(temp_sdf.height*temp_sdf.height)*703)
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
| 42| Sam| M| 77| 190| 22.52825096980941|
| 21| Diane| F| 70| 110|15.781632653061223|
| 33| Norm| M| 68| 240| 36.4878892733564|
| 29| Carla| F| 60| 90| 17.575|
+---+---------+------+------+------+------------------+
+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight| bmi|
+---+---------+------+------+------+------------------+
|109| Chandler| M| 72| 270|36.614583333333336|
| 10| Monica| F| 64| 123| 21.110595703125|
| 59| Ross| M| 74| 168| 21.56756756756757|
| 2| Phoebe| F| 64| 20| 3.4326171875|
+---+---------+------+------+------+------------------+