我在数据框中有3行,在2行中, id 列具有空值。我需要遍历该特定列ID的每一行,并用纪元时间进行替换,该时间应该是唯一的,并且应该在数据帧本身中发生。怎么做到呢? 例如:
id | name
1 a
null b
null c
我想要这个将空值转换为纪元时间的数据框。
id | name
1 a
1435232 b
1542344 c
答案 0 :(得分:-1)
npm install -S https://github.com/trufflesuite/truffle/archive/truffle-contract@4.0.0-beta.1.tar.gz
编辑
您需要确保UDF足够精确-如果它只有毫秒级的分辨率,您将看到重复的值。请参阅下面的示例,它清楚地说明了我的方法的工作原理:
npm ERR! Can't install https://github.com/trufflesuite/truffle/archive/truffle-contract@4.0.0-beta.1.tar.gz: Missing package name
npm ERR! A complete log of this run can be found in:
npm ERR! /xxx/xxx/.npm/_logs/2018-10-09T09_15_19_121Z-debug.log
答案 1 :(得分:-1)
检查一下
scala> val s1:Seq[(Option[Int],String)] = Seq( (Some(1),"a"), (null,"b"), (null,"c"))
s1: Seq[(Option[Int], String)] = List((Some(1),a), (null,b), (null,c))
scala> val df = s1.toDF("id","name")
df: org.apache.spark.sql.DataFrame = [id: int, name: string]
scala> val epoch = java.time.Instant.now.getEpochSecond
epoch: Long = 1539084285
scala> df.withColumn("id",when( $"id".isNull,epoch).otherwise($"id")).show
+----------+----+
| id|name|
+----------+----+
| 1| a|
|1539084285| b|
|1539084285| c|
+----------+----+
scala>
EDIT1:
我使用毫秒,然后我也得到相同的值。 Spark不会在时间部分捕获纳秒。许多行可能获得相同的毫秒数。因此,您假设根据纪元获取唯一值将不起作用。
scala> def getEpoch(x:String):Long = java.time.Instant.now.toEpochMilli
getEpoch: (x: String)Long
scala> val myudfepoch = udf( getEpoch(_:String):Long )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539087300957| b|
|1539087300957| c|
+-------------+----+
scala>
唯一的可能性是使用monotonicallyIncreasingId,但是值可能不会一直都具有相同的长度。
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)+monotonicallyIncreasingId).otherwise($"id")).show
warning: there was one deprecation warning; re-run with -deprecation for details
+-------------+----+
| id|name|
+-------------+----+
| 1| a|
|1539090186541| b|
|1539090186543| c|
+-------------+----+
scala>
EDIT2:
我可以欺骗System.nanoTime并获取增加的ID,但它们不是顺序的,但可以保持长度。见下文
scala> def getEpoch(x:String):String = System.nanoTime.toString.take(12)
getEpoch: (x: String)String
scala> val myudfepoch = udf( getEpoch(_:String):String )
myudfepoch: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.withColumn("id",when( $"id".isNull,myudfepoch('name)).otherwise($"id")).show
+------------+----+
| id|name|
+------------+----+
| 1| a|
|186127230392| b|
|186127230399| c|
+------------+----+
scala>
如果在群集中运行,请尝试此操作,如果获得重复值,请调整take(12)。