Question

*大家好，

我对你们所有人都有一个简单的问题。我有一个RDD，使用createStream方法从kafka流创建。现在我想在转换为dataframe之前将时间戳作为值添加到此rdd。我尝试使用withColumn（）向数据框添加值，但返回此错误*

val topicMaps = Map("topic" -> 1)
    val now = java.util.Calendar.getInstance().getTime()

    val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)

      messages.foreachRDD(rdd =>
          {

            val sqlContext = new org.apache.spark.sql.SQLContext(sc)
            import sqlContext.implicits._

            val dataframe = sqlContext.read.json(rdd.map(_._2))



        val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))

val d = dataframe.withColumn（＆＃34; timeStamp_column＆＃34;，dataframe.col（＆＃34; now＆＃34;）） org.apache.spark.sql.AnalysisException：无法解析列名＆＃34;现在＆＃34; （action，device_os_ver，device_type，event_name， item_name，lat，lon，memberid，productUpccd，tenantid）; 在org.apache.spark.sql.DataFrame $$ anonfun $ resolve $ 1.apply（DataFrame.scala：15

我开始知道DataFrames不能被改变，因为它们是不可变的，但RDD也是不可变的。那么最好的方法是什么呢。如何为RDD赋值（动态地将时间戳添加到RDD）。

Answer 1

尝试使用current_timestamp函数。

current_timestamp() //org.apache.spark.sql.functions._    
df.withColumn("time_stamp", lit(current_timestamp()))

Answer 2

这对我有用。我通常在此之后执行写操作。

val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())

Answer 3

要添加一个像时间戳一样的常量的新列，您可以使用node app.js函数：

http://localhost:3000/api/file

Answer 4

在Scala / Databricks中：

import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())

See my output

如何将时间戳添加为我的数据帧的额外列

4 个答案: