如何通过PySpark获取新专栏?

时间:2018-01-29 05:10:52

标签: python pyspark

enter image description here

我想通过使用公式获得C4, 例如,当c1 ='104001'时,计算C4

1 个答案:

答案 0 :(得分:0)

您可以使用以下内容添加另一列:

from pyspark.sql import Row
from pyspark import SparkContext, SQLContext
from pyspark.sql.functions import udf

sc = SparkContext()
sqlContext = SQLContext(sc)
l = [(25,24),[23,45],[24,56]]
rdd = sc.parallelize(l)
dummy = rdd.map(lambda x: Row(var1=int(x[0]),var2=int(x[1])))
dummyframe = sqlContext.createDataFrame(dummy)


def getValDivideSum(dataFrame):
    max = dataFrame.agg({"var2":'sum'}).collect()[0][0]
    dataFrame = dataFrame.withColumn("var3",dataFrame.var2/max).select("var1","var2","var3")
    return dataFrame

输出将是这样的:

+----+----+-----+
|var1|var2| var3|
+----+----+-----+
|  25|  24|0.192|
|  23|  45| 0.36|
|  24|  56|0.448|
+----+----+-----+

希望这会有所帮助。