数据框将每行空值替换为唯一的时间

时间:2018-10-09 09:12:24

标签: scala apache-spark dataframe

我在数据框中有3行,在2行中, id 列具有空值。我需要遍历该特定列ID的每一行,并用纪元时间进行替换,该时间应该是唯一的,并且应该在数据帧本身中发生。怎么做到呢? 例如:

id | name
1    a
null b
null c

我想要这个将空值转换为纪元时间的数据框。

id     |     name
1             a
1435232       b
1542344       c

2 个答案:

答案 0 :(得分:-1)

npm install -S https://github.com/trufflesuite/truffle/archive/truffle-contract@4.0.0-beta.1.tar.gz

编辑

您需要确保UDF足够精确-如果它只有毫秒级的分辨率,您将看到重复的值。请参阅下面的示例,它清楚地说明了我的方法的工作原理:

npm ERR! Can't install https://github.com/trufflesuite/truffle/archive/truffle-contract@4.0.0-beta.1.tar.gz: Missing package name

npm ERR! A complete log of this run can be found in:
npm ERR!     /xxx/xxx/.npm/_logs/2018-10-09T09_15_19_121Z-debug.log

答案 1 :(得分:-1)

检查一下

scala>  val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c"))
s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c))

scala> val df = s1.toDF("id","name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> val epoch = java.time.Instant.now.getEpochSecond
epoch: Long = 1539084285

scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show
+----------+----+
|        id|name|
+----------+----+
|         1|   a|
|1539084285|   b|
|1539084285|   c|
+----------+----+


scala>

EDIT1:

我使用毫秒,然后我也得到相同的值。 Spark不会在时间部分捕获纳秒。许多行可能获得相同的毫秒数。因此,您假设根据纪元获取唯一值将不起作用。

scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli
getEpoch: (x: String)Long

scala> val myudfepoch = udf( getEpoch(_:String):Long )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+-------------+----+
|           id|name|
+-------------+----+
|            1|   a|
|1539087300957|   b|
|1539087300957|   c|
+-------------+----+


scala>

唯一的可能性是使用monotonicallyIncreasingId,但是值可能不会一直都具有相同的长度。

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-------------+----+
|           id|name|
+-------------+----+
|            1|   a|
|1539090186541|   b|
|1539090186543|   c|
+-------------+----+


scala>

EDIT2:

我可以欺骗System.nanoTime并获取增加的ID,但它们不是顺序的,但可以保持长度。见下文

scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12)
getEpoch: (x: String)String

scala>  val myudfepoch = udf( getEpoch(_:String):String )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))

scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+------------+----+
|          id|name|
+------------+----+
|           1|   a|
|186127230392|   b|
|186127230399|   c|
+------------+----+


scala>

如果在群集中运行,请尝试此操作,如果获得重复值,请调整take(12)。