Question

我正在尝试读取一个文件并添加两个额外的列。 1. Seq no和2. filename。当我在scala中运行spark job时，IDE输出正确生成但是当我在putty中运行本地或集群模式时，作业停留在第2阶段（保存在File_Process）。即使我等了一个小时也没有进展。我正在测试1GB数据。

以下是我正在使用的代码

object File_Process
{
 Logger.getLogger("org").setLevel(Level.ERROR)  
 val spark = SparkSession
             .builder()
             .master("yarn")
             .appName("File_Process")
             .getOrCreate()
 def main(arg:Array[String])
 {
  val FileDF = spark.read
               .csv("/data/sourcefile/")
  val rdd = FileDF.rdd.zipWithIndex().map(indexedRow => Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
  val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
  val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)
  val dataframefinal = datasetnew.withColumn("Filetag", lit(filename))
  val query = dataframefinal.write
              .mode("overwrite")
              .format("com.databricks.spark.csv")
              .option("delimiter", "|")
              .save("/data/text_file/")
  spark.stop()
 }

如果我删除逻辑以添加seq_no，代码工作正常。用于创建seq no的代码是

val rdd = FileDF.rdd.zipWithIndex().map(indexedRow =>Row.fromSeq((indexedRow._2.toLong+SEED+1)+:indexedRow._1.toSeq))
val FileDFWithSeqNo = StructType(Array(StructField("UniqueRowIdentifier",LongType)).++(FileDF.schema.fields))
val datasetnew = spark.createDataFrame(rdd,FileDFWithSeqNo)

提前致谢。

spark dataframe使用scala写入文件

0 个答案: