Question

我有这种格式的数据框

Date        |   Return
01/01/2015       0.0
02/02/2015      -0.02
03/02/2015       0.05
04/02/2015       0.07

我想复合并添加一个将返回Compounded return的列。复合回报计算如下：

第1行1。
（1 + Return（i））*复合（i-1））

所以我的df终于会

Date          |   Return  | Compounded
01/01/2015         0.0         1.0
02/02/2015        -0.02        1.0*(1-0.2)=0.8
03/02/2015         0.05        0.8*(1+0.05)=0.84
04/02/2015         0.07        0.84*(1+0.07)=0.8988

Java中的答案将受到高度赞赏。

Answer 1

首先，我们定义一个函数f(line)（建议一个更好的名字，请!!）来处理这些行。

def f(line):
    global firstLine
    global last_compounded
    if line[0] == 'Date':
        firstLine = True
        return (line[0], line[1], 'Compounded')
    else:
        firstLine = False
    if firstLine:
        last_compounded = 1
        firstLine = False
    else:
        last_compounded = (1+float(line[1]))*last_compounded
    return (line[0], line[1], last_compounded)

使用两个全局变量（可以改进？），我们保留Compounded（i-1）值，如果我们正在处理第一行。

您的数据在 some_file 中，解决方案可能是：

rdd = sc.textFile('some_file').map(lambda l: l.split())
r1 = rdd.map(lambda l: f(l))

rdd.collect()
   [[u'Date'，u'Return']，[u'01 / 01/2015'，u'0.0']，[u'02 / 02/2015'，u'-0.02']，[u'03 / 02/2015'，u'0.05']，[u'04 / 02/2015'，u'0.07']]
  r1.collect()
  [（u'Date'，u'Return'，'Compounded'），（u'01 / 01/2015'，u'0.0'，1.0），（u'02 / 02/2015'，u'-0.02' ，0.98），（u'03 / 02/2015'，u'0.05'，1.05），（u'04 / 02/2015'，u'0.07'，1.1235000000000002）]

Answer 2

您还可以创建自定义聚合函数并在窗口函数中使用它。

像这样的东西（写自由形式所以可能会有一些错误）：

package com.myuadfs

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

class MyUDAF() extends UserDefinedAggregateFunction {

  def inputSchema: StructType = StructType(Array(StructField("Return", DoubleType)))

  def bufferSchema = StructType(StructField("compounded", DoubleType))

  def dataType: DataType = DoubleType

  def deterministic = true

  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = 1.0 // set compounded to 1
  }

  def update(buffer: MutableAggregationBuffer, input: Row) = {
    buffer(0) = buffer.getDouble(0) * ( input.getDouble(0) + 1)
  }

  // this generally merges two aggregated buffers. This means this 
  // would not have worked properly had you been working with a regular
  // aggregate but since you are planning to use this inside a window 
  // only this should not be called at all. 
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0)
 }

  def evaluate(buffer: Row) = {
    buffer.getDouble(0)
  }
}

现在你可以在窗口函数中使用它了。像这样：

import org.apache.spark.sql.Window
val windowSpec = Window.orderBy("date")
val newDF = df.withColumn("compounded", df("Return").over(windowSpec)

请注意，这有一个限制，即整个计算应该适合单个分区，因此如果数据太大，则会出现问题。也就是说，名义上这种操作是在按键进行一些分区之后执行的（例如，将partitionBy添加到窗口中），然后单个元素应该是键的一部分。

在Spark中复合

2 个答案: