Question

我正在努力实现以下目标：

我收到了一个包含3列的时间序列数据 - “Timestamp，Lag1_Timestamp，MyData”，其中Lag1_Timestamp是滞后1的时间戳。
我必须首先使用约束创建另一列 time_diff =时间戳 - Lag1_Timestamp ，每当我遇到MyData = 0，那么time_diff = 0。

val df3 = df2.withColumn（“time_diff”，when（df2（“Timestamp”）=== 0,0）。otherwise（when（df2（“MyData”）=== 0,0）.otherwise（（DF2（ “时间戳”） - DF2（ “Lag1_Timestamp”）））））
计算出 time_diff 后，我需要按如下方式计算累积金额：

一个。在'0'处开始累积和，以便时间戳0，cum_sum = 0

湾然后继续查找每条记录的累计金额。（假设数据帧按时间戳排序）。

℃。但是，只要遇到 time_diff = 0 的值，就会将累积和重新启动为0并从该点重新开始累计和。

val list = df3.collect()
val cumSum = Array.ofDim[Double](list.length);
for((cur,i) <- list.view.zipWithIndex){
  if(i!=0){
    var prev = list(i-1);
    if(prev(1)!=0 && cur(1)!=0){
       cumSum(i) = cumSum(i-1) + cur(3).asInstanceOf[Double] + prev(3).asInstanceOf[Double]
    }
  }
}
val summing = sc.parallelize(cumSum).toDF("Uptime")
def addIndex(df: DataFrame) = sqlContext.createDataFrame(
  // Add index
  df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
  // Create schema
  StructType(df.schema.fields :+ StructField("_index", LongType, false))
)
// Add indices
val aWithIndex = addIndex(inDF)
val bWithIndex = addIndex(summing)

// Join and clean
val ab1 = aWithIndex
  .join(bWithIndex, Seq("_index")).orderBy(asc("Timestamp"))
  .drop("_index")

虽然代码有效，但速度很慢。我的问题是，如果有更好的方法来实现上述相同的目标。

谢谢和问候

关于操纵数据帧

0 个答案: