Question

我有一个结果RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)。这有以这种格式输出：

[(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....]

我想要的是创建一个CSV文件，其中一列为labels（上面输出中元组的第一部分），另一列为predictions（元组输出的第二部分）。但我不知道如何使用Python在Spark中写入CSV文件。

如何使用上述输出创建CSV文件？

Answer 1

只需将map RDD（labelsAndPredictions）的行放入字符串（CSV行），然后使用rdd.saveAsTextFile()。

def toCSVLine(data):
  return ','.join(str(d) for d in data)

lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')

Answer 2

我知道这是一个老帖子。但是为了帮助搜索相同内容的人，以下是我在PySpark 1.6.2中将单列RDD写入单个CSV文件的方法

RDD：

>>> rdd.take(5)
[(73342, u'cells'), (62861, u'cell'), (61714, u'studies'), (61377, u'aim'), (60168, u'clinical')]

现在代码：

# First I convert the RDD to dataframe
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])

DF：

>>> df.show()
+-----+-----------+
|count|       word|
+-----+-----------+
|73342|      cells|
|62861|       cell|
|61714|    studies|
|61377|        aim|
|60168|   clinical|
|59275|          2|
|59221|          1|
|58274|       data|
|58087|development|
|56579|     cancer|
|50243|    disease|
|49817|   provided|
|49216|   specific|
|48857|     health|
|48536|      study|
|47827|    project|
|45573|description|
|45455|  applicant|
|44739|    program|
|44522|   patients|
+-----+-----------+
only showing top 20 rows

现在写入CSV

# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')

P.S：我只是一个初学者，从Stackoverflow中的帖子中学习。所以我不知道这是不是最好的方法。但它对我有用，我希望它会帮助别人！

Answer 3

It's not good to just join by commas because if fields contain commas, they won't be properly quoted, e.g. ','.join(['a', 'b', '1,2,3', 'c']) gives you a,b,1,2,3,c when you'd want a,b,"1,2,3",c. Instead, you should use Python's csv module to convert each list in the RDD to a properly-formatted csv string:

# python 3
import csv, io

def list_to_csv_str(x):
    """Given a list of strings, returns a properly-csv-formatted string."""
    output = io.StringIO("")
    csv.writer(output).writerow(x)
    return output.getvalue().strip() # remove extra newline

# ... do stuff with your rdd ...
rdd = rdd.map(list_to_csv_str)
rdd.saveAsTextFile("output_directory")

Since the csv module only writes to file objects, we have to create an empty "file" with io.StringIO("") and tell the csv.writer to write the csv-formatted string into it. Then, we use output.getvalue() to get the string we just wrote to the "file". To make this code work with Python 2, just replace io with the StringIO module.

If you're using the Spark DataFrames API, you can also look into the DataBricks save function, which has a csv format.

Answer 4

    def toCSV(RDD):

        for element in RDD:
        return ','.join(str(element))

    rows_of_csv=RDD.map(toCSV)
    rows_of_csv.saveAsTextFile('/FileStore/tables/name_of_csv_file.csv')

# choose your path based on your distributed file system

如何将生成的RDD写入Spark python中的csv文件

4 个答案: