Question

Pyspark的文档显示了从sqlContext，sqlContext.read()以及各种其他方法构建的DataFrame。

（见https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html）

是否可以将Dataframe子类化并独立实例化？我想为基础DataFrame类添加方法和功能。

Answer 1

这实际上取决于你的目标。

从技术上讲，这是可能的。 pyspark.sql.DataFrame只是一个普通的Python类。如果需要，你可以扩展它或猴子补丁。

from pyspark.sql import DataFrame

class DataFrameWithZipWithIndex(DataFrame):
     def __init__(self, df):
         super(self.__class__, self).__init__(df._jdf, df.sql_ctx)

     def zipWithIndex(self):
         return (self.rdd
             .zipWithIndex()
             .map(lambda row: (row[1], ) + row[0])
             .toDF(["_idx"] + self.columns))

使用示例：

df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"])

with_zipwithindex = DataFrameWithZipWithIndex(df)

isinstance(with_zipwithindex, DataFrame)

True

with_zipwithindex.zipWithIndex().show()

+----+---+---+
|_idx|foo|bar|
+----+---+---+
|   0|  a|  1|
+----+---+---+

实际上，你在这里做得不够。 DataFrame是一个围绕JVM对象的瘦包装器，除了提供文档字符串，将参数转换为本机所需的表单，调用JVM方法以及在必要时使用Python适配器包装结果之外，它不会做很多事情。

使用纯Python代码，您甚至无法靠近DataFrame / Dataset内部或修改其核心行为。如果你正在寻找独立的，那么Python只能实现Spark DataFrame。

是否可以在Pyspark中继承DataFrame？

1 个答案: