如何将多个带有双打的CSV文件合并到一个带有文件名的RDD中?

时间:2017-07-20 15:10:24

标签: apache-spark pyspark correlation

我有3组CSV文件,它们基本上是一个双重值列表(每行有一个双精度值)并且每月拆分:

<uses-permission android:name="android.permission.WRITE_EXTERNAL_STORAGE"/>

我想计算A,B,C,D上的所有Pearson相关对。 PySpark有一种相关方法。

A: aJan.csv, aFeb.csv, aMarch.csv
B: bJan.csv, bFeb.csv, bMarch.csv
C: cJan.csv, cFeb.csv, cMarch.csv 
D: DJan.csv, DFeb.csv, DMarch.csv

我的问题是我如何从3个文件中创建1个RDD,即aJan.csv,aFeb.csv,aMarch.csv,然后用于其他文件。我知道我可以做一些这里提到的事情:How to read multiple text files into a single RDD?但是我想要单个视图采用月份附加格式,即第一个数据来自Jan然后追加Feb.csv然后追加March.csv。

2 个答案:

答案 0 :(得分:0)

  

如何从3个文件制作1个RDD

请不要。鉴于你的问题,你似乎刚刚开始了Spark的旅程,你将使用低级RDD API ... ... ...不...为......你(抱歉暂停,但是想表达我对它的感受。)

如果你坚持......

我认为你应该从SparkContext.wholeTextFiles运营商开始。

  

wholeTextFiles(path:String,minPartitions:Int = defaultMinPartitions):RDD [(String,String)] 从HDFS读取文本文件目录,HDFS是一个本地文件系统(在所有节点上都可用)或任何Hadoop支持的文件系统URI。每个文件都作为单个记录读取,并以键值对的形式返回,其中键是每个文件的路径,值是每个文件的内容。

这将为您提供CSV文件的内容及其路径。只需根据需要改变RDD,就可以了......并且......你已经完成了。

请考虑使用Spark SQL的数据集API,为您提供spark.read.csvorderBy等等。请帮个忙。

答案 1 :(得分:0)

I propose the following approach:

First, obtain a ParallelSet (for optimized scheduling, if you do anything else with the data, before the union below) of your initial data, containing an explicit or implicit mapping of month -> file_for_month.csv

i.e.:

val files = Set (("January","aJan.csv"),("February","aFeb.csv")).par

then you can generate a set of DataFrames like this:

val monthDfs = files.map(
                         month => 
                           spark.read.csv("month._2")
                                .withColum("month", literal(month._1))
                        )

to combine them into a single DataFrame:

spark.createDataFrame(
    spark.sparkContext.union(
         monthDfs.map(_.rdd).toSeq
         ),
    monthDfs.head.schema)

This is a bit hacky, since it uses .rdd().... I had .rdd inexplicably fail during runtime previously -- I could fix it by assigning it to a variable outside the scope of the final mapping. YYMV)

But, Voilà, you have a single DataFrame with a "month" column, containing all your data. If you're scared of .rdd (and you should be), and the number of files isn't in the tens of thousands, then you can also simply use something like this:

files.reduce((a,b) => a.union(b))

These operations are part of the execution graph though, and will increase it in size by the number of elements in files -- eventually causing a slowdown or even crashes observed somewhere in the ~1000 elements range. See: SPARK-15326 "Not a Problem"Non-Linear-Analysis Cost