如何在RDD数据类型的开头和结尾添加引号?

时间:2018-01-27 03:19:53

标签: scala apache-spark

我希望在RDD数据类型中添加引号。所以,我有一个文本文件,它被加载到RDD中。这是Scala代码:

val file: String = "file:///data/book/starwars"
val bookStarWarsRDD = sc.textFile(file);

文本文件实际上只是星球大战故事的计划文本,例如 绝地归来 。我希望在第一个单词和最后一个单词上加上引号,如:

"A long time ago in a galaxy far, far away...

然后,在最后一个地方引用最后一个单词(或故事的最后一个部分),如:

...and the saga continues. The end.”

如何使用RDD执行此操作?

2 个答案:

答案 0 :(得分:2)

您应该使用wholeTextFiles来满足您的要求,因为wholeTextFiles会将文件读作Tuple2(filename, whole_texts)。因此,您可以在"的开头和结尾添加whole_texts

val file : String = "file:///data/book/starwars"
val bookStarWarsRDD = sc.wholeTextFiles(file).map(kv => "\""+kv._2+"\"").flatMap(_.split("\n"));
bookStarWarsRDD.foreach(println)

你应该有你想要的输出。

答案 1 :(得分:1)

对于RDD使用:

  val myDF = Seq(("Sentence1 something. Sentence2 something")).toDF("text")

  // You may have to adjust index of text column by replacing x(0) with x(index in ur case)
  val test = myDF.rdd.map{ case  (x) => (x(0) , "\"" + x(0) + "\"") } 
  test.foreach(println)

打印:

(Sentence1 something. Sentence2 something,"Sentence1 something. Sentence2 something")

如果您可以使用数据框:

val myDF = Seq(("Sentence1 something. Sentence2 something")).toDF("text")

val withQuotes = myDF.withColumn("textWithQuotes",concat(lit("\""),col("text"),lit("\""))  )
withQuotes.show(false)
scala> withQuotes.show(false)
+----------------------------------------+------------------------------------------+
|text                                    |textWithQuotes                            |
+----------------------------------------+------------------------------------------+
|Sentence1 something. Sentence2 something|"Sentence1 something. Sentence2 something"|
+----------------------------------------+------------------------------------------+