Pyspark变换方法等同于Scala数据集#变换方法

时间:2017-09-15 20:52:37

标签: apache-spark pyspark apache-spark-sql apache-spark-dataset

Spark Scala API有Dataset#transform方法,可以轻松链接自定义DataFrame转换,如下所示:

val weirdDf = df
  .transform(myFirstCustomTransformation)
  .transform(anotherCustomTransformation)

我没有看到pyspark in the documentation的等效transform方法。

是否有PySpark方式链接自定义转换?

如果没有,如何对pyspark.sql.DataFrame类进行猴子修补以添加transform方法?

2 个答案:

答案 0 :(得分:1)

实现:

from pyspark.sql.dataframe import DataFrame

def transform(self, f):
    return f(self)

DataFrame.transform = transform

用法:

spark.range(1).transform(lambda df: df.selectExpr("id * 2"))

答案 1 :(得分:0)

使用SQLTransformer对象(或任何其他Transformer)的Transformer管道是Spark解决方案,可简化链接转换。例如:

from pyspark.ml.feature import SQLTransformer
from pyspark.ml import Pipeline, PipelineModel

df = spark.createDataFrame([
    (0, 1.0, 3.0),
    (2, 2.0, 5.0)
], ["id", "v1", "v2"])
sqlTrans = SQLTransformer(
    statement="SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlSelectExpr = SQLTransformer(statement="SELECT *, (id * 2) AS v5 FROM __THIS__")

pipeline = Pipeline(stages=[sqlTrans, sqlSelectExpr])
pipelineModel = pipeline.fit(df)
pipelineModel.transform(df).show()

当所有转换都是上述简单表达式时,另一种链接方法是使用单个SQLTransformer和字符串操作:

transforms = ['(v1 + v2) AS v3',
              '(v1 * v2) AS v4',
              '(id * 2) AS v5',
              ]
selectExpr = "SELECT *, {} FROM __THIS__".format(",".join(transforms))
sqlSelectExpr = SQLTransformer(statement=selectExpr)
sqlSelectExpr.transform(df).show()

请记住,Spark SQL转换可以优化,并且比定义为Python用户定义函数(UDF)的转换要快。