pySpark DataFrames与SciPy的聚合函数

时间:2015-05-19 18:46:28

标签: apache-spark dataframe pyspark

我尝试了一些不同的场景来尝试使用Spark的1.3 DataFrame来处理像sciPy kurtosis或numpy std这样的东西。这是示例代码,但它只挂在10x10数据集(10行,10列)。我试过了:

print df.groupBy().agg(kurtosis(df.offer_id)).collect()

print df.agg(kurtosis(df.offer_ID)).collect()

但这没有问题:

print df.agg(F.min(df.offer_id), F.min(df.decision_id)).collect()

我的猜测是因为F是:from pyspark.sql import functions as F是一个编程的sql函数。我如何使用数据帧来处理数据集上的峰度?

这也只是挂起:

print df.map(kurtosis(df.offer_id)).collect()

1 个答案:

答案 0 :(得分:2)

可悲的是,Spark SQL当前对Python UDF的UDF支持有点缺乏。我一直在尝试在Scala中添加一些UDF并让它们可以从Python中调用我正在进行的项目,因此我使用kurtosis作​​为UDAF实现了一个快速的概念证明。该分支目前居住在https://github.com/holdenk/sparklingpandas/tree/add-kurtosis-support

第一步是在Scala中定义我们的UDAF - 这可能不太理想,但这是一个实现:

object functions {
  def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e)))
}

case class Kurtosis(child: Expression) extends AggregateExpression {
  def this() = this(null)

  override def children = child :: Nil
  override def nullable: Boolean = true
  override def dataType: DataType = DoubleType
  override def toString: String = s"Kurtosis($child)"
  override def newInstance() =  new KurtosisFunction(child, this)
}

case class KurtosisFunction(child: Expression, base: AggregateExpression) extends AggregateFunction {
  def this() = this(null, null)

  var data = scala.collection.mutable.ArrayBuffer.empty[Any]
  override def update(input: Row): Unit = {
    data += child.eval(input)
  }

  // This function seems shaaady
  // TODO: Do something more reasonable
  private def toDouble(x: Any): Double = {
    x match {
      case x: NumericType => EvilSqlTools.toDouble(x.asInstanceOf[NumericType])
      case x: Long => x.toDouble
      case x: Int => x.toDouble
      case x: Double => x
    }
  }
  override def eval(input: Row): Any = {
    if (data.isEmpty) {
      println("No data???")
      null
    } else {
      val inputAsDoubles = data.toList.map(toDouble)
      println("computing on input "+inputAsDoubles)
      val inputArray = inputAsDoubles.toArray
      val apacheKurtosis = new ApacheKurtosis()
      val result = apacheKurtosis.evaluate(inputArray, 0, inputArray.size)
      println("result "+result)
      Cast(Literal(result), DoubleType).eval(null)
    }
  }
}

然后我们可以使用Spark SQL的functions.py实现中使用的类似逻辑:

"""Our magic extend functions. Here lies dragons and a sleepy holden."""
from py4j.java_collections import ListConverter

from pyspark import SparkContext
from pyspark.sql.dataframe import Column, _to_java_column

__all__ = []
def _create_function(name, doc=""):
    """ Create a function for aggregator by name"""
    def _(col):
        sc = SparkContext._active_spark_context
        jc = getattr(sc._jvm.com.sparklingpandas.functions, name)(col._jc if isinstance(col, Column) else col)
        return Column(jc)
    _.__name__ = name
    _.__doc__ = doc
    return _

_functions = {
    'kurtosis': 'Calculate the kurtosis, maybe!',
}


for _name, _doc in _functions.items():
    globals()[_name] = _create_function(_name, _doc)
del _name, _doc
__all__ += _functions.keys()
__all__.sort()

然后我们可以继续将其称为UDAF,如下所示:

from sparklingpandas.custom_functions import *
import random
input = range(1,6) + range(1,6) + range(1,6) + range(1,6) + range(1,6) + range(1,6)
df1 = sqlContext.createDataFrame(sc.parallelize(input)\
                                    .map(lambda i: Row(single=i, rand= random.randint(0,100000))))
df1.collect()
import pyspark.sql.functions as F
x = df1.groupBy(df1.single).agg(F.min(df1.rand))
x.collect()
j = df1.groupBy(df1.single).agg(kurtosis(df1.rand))
j.collect()