我希望在RDD数据类型中添加引号。所以,我有一个文本文件,它被加载到RDD中。这是Scala代码:
val file: String = "file:///data/book/starwars"
val bookStarWarsRDD = sc.textFile(file);
文本文件实际上只是星球大战故事的计划文本,例如 绝地归来 。我希望在第一个单词和最后一个单词上加上引号,如:
"A long time ago in a galaxy far, far away...
然后,在最后一个地方引用最后一个单词(或故事的最后一个部分),如:
...and the saga continues. The end.”
如何使用RDD执行此操作?
答案 0 :(得分:2)
您应该使用wholeTextFiles
来满足您的要求,因为wholeTextFiles
会将文件读作Tuple2(filename, whole_texts)
。因此,您可以在"
的开头和结尾添加whole_texts
。
val file : String = "file:///data/book/starwars"
val bookStarWarsRDD = sc.wholeTextFiles(file).map(kv => "\""+kv._2+"\"").flatMap(_.split("\n"));
bookStarWarsRDD.foreach(println)
你应该有你想要的输出。
答案 1 :(得分:1)
对于RDD使用:
val myDF = Seq(("Sentence1 something. Sentence2 something")).toDF("text")
// You may have to adjust index of text column by replacing x(0) with x(index in ur case)
val test = myDF.rdd.map{ case (x) => (x(0) , "\"" + x(0) + "\"") }
test.foreach(println)
打印:
(Sentence1 something. Sentence2 something,"Sentence1 something. Sentence2 something")
如果您可以使用数据框:
val myDF = Seq(("Sentence1 something. Sentence2 something")).toDF("text")
val withQuotes = myDF.withColumn("textWithQuotes",concat(lit("\""),col("text"),lit("\"")) )
withQuotes.show(false)
scala> withQuotes.show(false)
+----------------------------------------+------------------------------------------+
|text |textWithQuotes |
+----------------------------------------+------------------------------------------+
|Sentence1 something. Sentence2 something|"Sentence1 something. Sentence2 something"|
+----------------------------------------+------------------------------------------+