如何遍历现有的多个pyspark数据帧并添加包含计算的列

时间:2019-08-15 20:09:00

标签: pyspark

我有多个已经存在的pyspark数据框。我需要向每个数据框添加一个新列。我可以对解决方案进行“硬编码”,并且可以正常工作。但是,当我尝试使用for循环将列添加到所有数据帧时,出现错误。该错误表明“ withColumn”不是数据框的属性。我不明白这个错误。

我在下面键入的内容是我尝试过的内容。我是python和pyspark新手。

从模块sql导入pyspark类行

from pyspark.sql import *

创建示例数据

---->为第一个数据帧创建一些数据

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Sam', gender='M', height = '77',age = '42',weight = '190')
input2 = Row(firstname='Diane', gender='F', height =  '70',age = '21', weight ='110')
input3 = Row(firstname='Norm',gender='M', height =  '68',age = '33',weight = '240')
input4 = Row(firstname='Carla', gender='F', height =  '60',age = '29',weight = '90')

---->为首次输入创建spark数据框

bios1 = Row(input1, input2, input3, input4)
bios1_sdf = spark.createDataFrame(bios1)

---->创建更多数据

Labels = Row("firstName", "gender", "height", "age", "weight")
input1 = Row(firstname='Chandler', gender='M', height = '72',age = '109',weight = '270')
input2 = Row(firstname='Monica', gender='F', height =  '64',age = '10', weight ='123')
input3 = Row(firstname='Ross',gender='M', height =  '74',age = '59',weight = '168')
input4 = Row(firstname='Phoebe', gender='F', height =  '64',age = '2',weight = '20')

---->创建第二个输入的spark数据框

bios2 = Row(input1, input2, input3, input4)
bios2_sdf = spark.createDataFrame(bios2)

硬编码方法

---->添加bmi t数据帧一

bios1_sdf = bios1_sdf.withColumn('bmi',(bios1_sdf.weight/(bios1_sdf.height * bios1_sdf.height)*703))
print("data in bios1_sdf")
bios1_sdf.show()

---->将bmi添加到数据帧2

bios2_sdf = bios2_sdf.withColumn('bmi',(bios2_sdf.weight/(bios2_sdf.height * bios2_sdf.height)*703))
print("data in bios2_sdf")
bios2_sdf.show()

------> bios1_sdf中的数据

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

------> bios2_sdf中的数据

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+

动态编码方法

---->初始化字典

dict_of_df ={}

---->要添加列的pyspark数据帧列表

list_of_sdf = [bios1_sdf, bios2_sdf]



for i in range(1,2+1):


    # name of dataframe
    key_name_in = 'bios'+str(i)+'_sdf'
    dict_of_df[key_name_in] = list_of_sdf[i-1]
    temp_sdf = dict_of_df[key_name_in]

    # add bmi
    dict_of_df[key_name_in] = temp_sdf.withcolumn('bmi',sdf.weight/(temp_sdf.height*temp_sdf.height)*703)

错误

我期望得到这些结果。结果与我的“硬编码”示例相同。

-----> bios1_sdf中的数据

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
| 42|      Sam|     M|    77|   190| 22.52825096980941|
| 21|    Diane|     F|    70|   110|15.781632653061223|
| 33|     Norm|     M|    68|   240|  36.4878892733564|
| 29|    Carla|     F|    60|    90|            17.575|
+---+---------+------+------+------+------------------+

-----> bios2_sdf中的数据

+---+---------+------+------+------+------------------+
|age|firstname|gender|height|weight|               bmi|
+---+---------+------+------+------+------------------+
|109| Chandler|     M|    72|   270|36.614583333333336|
| 10|   Monica|     F|    64|   123|   21.110595703125|
| 59|     Ross|     M|    74|   168| 21.56756756756757|
|  2|   Phoebe|     F|    64|    20|      3.4326171875|
+---+---------+------+------+------+------------------+

0 个答案:

没有答案