在DataFrame中添加新列,其中包含另一列值的邻居数

时间:2017-10-30 17:07:00

标签: scala apache-spark dataframe spark-dataframe

我有一个这样的DataFrame

org.apache.spark.sql.DataFrame = [Timestamp: int, AccX: double ... 17 more fields]`

时间戳不是连续的,并且是以纪元格式。

我想添加一个新列,每列的timeStamps数量接近当前行的时间戳。

示例:

时间戳

1
5
6
12
13
16

想象一下,我们的范围是3.输出将是:

|      TimeStamp      |    New column    |
|          1          |         1        |
|          5          |         2        |
|          6          |         2        |
|          12         |         2        |
|          13         |         3        |
|          16         |         2        |

我在想做类似的事情:

MyDF.map{x => MyDF.filter(MyDF("Timestamp").gt(x.getAs[Int]("Timestamp") - range).lt(x.getAs[Int]("Timestamp") + range) ).count()}

但是这留下了:org.apache.spark.sql.Dataset[Long] = [value: bigint]

我不知道如何处理。

是否有人更了解如何处理这个问题?

由于

更新 我正在使用运行Spark版本2.1.1的zeppelin笔记本 在尝试了@Dennis Tsoi提出的解决方案后,我在尝试对结果Dataframe执行操作时遇到错误,例如show或collect。

以下是错误的全文:

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2104)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)
  at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:371)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)
  at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
  at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
  at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2386)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withNewExecutionId(Dataset.scala:2788)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2385)
  at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2392)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2128)
  at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2127)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2818)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:2127)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2342)
  at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:638)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:597)
  at org.apache.spark.sql.Dataset.show(Dataset.scala:606)
  ... 88 elided
Caused by: java.io.NotSerializableException: org.apache.spark.sql.expressions.WindowSpec
Serialization stack:
    - object not serializable (class: org.apache.spark.sql.expressions.WindowSpec, value: org.apache.spark.sql.expressions.WindowSpec@79df42d)
    - field (class: $iw, name: windowSpec, type: class org.apache.spark.sql.expressions.WindowSpec)
    - object (class $iw, $iw@20ade815)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@77cac38a)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ebfd642)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ee19937)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@67b1d8f0)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@16ca3d83)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3129d731)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@142a2936)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@494facc5)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@45e32c0a)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@509c3eb6)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7bba53a2)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@20971db8)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@ba81c26)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@9375cbb)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3226a593)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@201516a3)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1ac15b76)
    - field (class: $line20176553781522.$read, name: $iw, type: class $iw)
    - object (class $line20176553781522.$read, $line20176553781522.$read@21cc8115)
    - field (class: $iw, name: $line20176553781522$read, type: class $line20176553781522.$read)
    - object (class $iw, $iw@57677eee)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@1d619339)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@63f875)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2a8641fe)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@279b1062)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2a06eb02)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6071a045)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@36b8b963)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@49987884)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6cdfa5ad)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@3bea2150)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@7d1c7dc)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@78f47403)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@6327d388)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@5d120092)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@4da8dd9c)
    - field (class: $iw, name: $iw, type: class $iw)
    - object (class $iw, $iw@2afee9a4)
    - field (class: $line20176553782370.$read, name: $iw, type: class $iw)
    - object (class $line20176553782370.$read, $line20176553782370.$read@7112605e)
    - field (class: $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, name: $line20176553782370$read, type: class $line20176553782370.$read)
    - object (class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw@cc82e3c)
    - field (class: $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, name: $outer, type: class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw)
    - object (class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw, $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw@9ec8a4e)
    - field (class: $$$$7f619eaa173efe86d354fc4efb19aab8$$$$$anonfun$1, name: $outer, type: class $$$$24338a4fbcb24dc6d683541cf6403767$$$$iw)
    - object (class $$$$7f619eaa173efe86d354fc4efb19aab8$$$$$anonfun$1, <function1>)
    - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, name: func$2, type: interface scala.Function1)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2, <function1>)
    - field (class: org.apache.spark.sql.catalyst.expressions.ScalaUDF, name: f, type: interface scala.Function1)
    - object (class org.apache.spark.sql.catalyst.expressions.ScalaUDF, UDF(input[0, int, true]))
    - element of array (index: 0)
    - array (class [Ljava.lang.Object;, size 1)
    - field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)
    - object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)
  at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
  at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
  at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:10
0)
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
  ... 121 more

1 个答案:

答案 0 :(得分:1)

<强>更新

dtlt等未经修改的查找操作可能非常昂贵,因此我提出了以下解决方案。

val timestampsDF = 
    Seq(
        ( 1, "smth1" ),
        ( 5, "smth2" ),
        ( 6, "smth3" ),
        ( 12, "smth4" ),
        ( 13, "smth5" ),
        ( 16, "smth6" )
    )
    .toDF( "TimeStamp", "smth" )

val timestampsStatic = 
    timestampsDF
    .select("TimeStamp")
    .as[ ( Int ) ]
    .collect()

def countNeighbors = udf( ( currentTs: Int, timestamps: Seq[ Int ] ) => {

    timestamps.count( ( ts ) => Math.abs( currentTs - ts ) <= 3 )
} )

val alltimeDF = 
    timestampsDF
    .withColumn( 
        "All TimeStamps", 
        lit( timestampsStatic )
    )

val neighborsDF =
    alltimeDF
    .withColumn( 
        "New Column", 
        countNeighbors( alltimeDF( "TimeStamp" ), alltimeDF( "All TimeStamps" ) )
    )
    .drop( "All TimeStamps" )

neighborsDF.show()

结果

+---------+-----+----------+
|TimeStamp| smth|New Column|
+---------+-----+----------+
|        1|smth1|         1|
|        5|smth2|         2|
|        6|smth3|         2|
|       12|smth4|         2|
|       13|smth5|         3|
|       16|smth6|         2|
+---------+-----+----------+

内存消耗问题

由于只能在节点上访问数据框,因此您必须将所有时间戳从原始DF复制到另一列作为静态字段。这会导致内存消耗增加但您无法从UDF访问所有列值,只能访问相应行的列值。无论如何,我认为这是真正的“火花方式”。