将内部文件转换为数据帧到另一个数据帧或RDD

时间:2019-09-03 22:49:02

标签: scala apache-spark pyspark

我有一个进程,每5、10或20分钟生成一次文件。然后另一个进程将列出绝对路径,并将其每小时保存一次到文件中。 结构如下

logan@Everis-PC  ~/Datasets/dev/path > cat path1
/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_D200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_S200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_V200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_D200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_S200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_V200_20190809.DAT

我的代码如下

val pathFile = "/home/logan/Datasets/dev/path"

sc.wholeTextFiles(pathFile).collect.foreach {
       hdfspartition =>

       val a = sc.parallelize(Seq(hdfspartition._2)).toDF
       a.show(false)

     }

但是我得到一个数据框,其中的数据在一行中。

+--------------------------------------------------------------------------------+
|value                                                                           |
+--------------------------------------------------------------------------------+
|/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_D200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_S200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPo_20190801_001808_V200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_D200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_S200_20190809.DAT
/home/logan/Datasets/novum_dev/in/TasPr_20190801_001828_V200_20190809.DAT
|
+------------------------------------------------------------------------------+

嗨,我需要提取“ pathFile”中找到的文件的内容。 pathFile”包含带有更多文件列表的文件。.DAT文件(/../../novum_dev/in/TasPo_20190801_001808_D200_20190809.DAT)具有要分析的数据。  我试图将第一个DF(wholeTextFiles)转换为字符串数组,然后转换为由(,)分割的字符串

sc.wholeTextFiles(pathFile).collect.foreach {
   hdfspartition =>
  val fa = hdfspartition._2.split("\\r?\\n")   
   val fs = fa.mkString(",")    
    val cdr = sc.textFile(fs).map(line => line.split("|", -1))
    .map(x => Row.fromSeq(x))

1 个答案:

答案 0 :(得分:0)

您可能应该使用spark.read.format("text")

import org.apache.spark.sql._

val spark = SparkSession.builder.getOrCreate();   
val pathFile = "/home/logan/Datasets/dev/path"
val dataset = spark.read.format("text").load(pathFile)

dataset.show()