PySpark DataFrame(pyspark.sql.dataframe.DataFrame)转换为CSV

时间:2018-07-20 17:52:06

标签: dataframe pyspark export-to-csv pyspark-sql

我有一个这样的交易表:

transactions.show()

+---------+-----------------------+--------------------+
|person_id|collect_set(title_name)|          prediction|
+---------+-----------------------+--------------------+
|  3513736|   [Make or Break, S...|[Love In Island.....|
|  3516443|        [The Blacklist]|[Moordvrouw, The ...|
|  3537643|   [S4 - Dutch progr...|[Vamos met de Fam...|
|  3547688|   [Phileine Zegt So...|                  []|
|  3549345|   [The Wolf of Wall...|                  []|
|  3550565|   [Achtste Groepers...|                  []|
|  3553669|   [Mega Mindy: Reis...|                  []|
|  3558162|   [Snitch, Philomen...|                  []|
|  3561387|   [Automata, The Hi...|[Bella Donna's, M...|
|  3570126|   [The Wolf of Wall...|                  []|
|  3576602|   [Harry & Meghan: ...|[Weg van Jou, Moo...|
|  3586366|   [Gooische Vrouwen...|[Familieweekend, ...|
|  3586560|   [Hooligans 3: Nev...|                  []|
|  3590208|   [S2 - Dutch drama...|[Love In Island.....|
+---------+-----------------------+——————————----------+

表的结构类似于

transactions.printSchema()

root
 |-- person_id: long (nullable = false)
 |-- collect_set(title_name): array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: string (containsNull = true)

现在,我想将此表写到csv并保留每一列的内容。尝试以下

transactions.repartition(1)\
.write.mode('overwrite')\
.save(path="//Users/King/Documents/my_final.csv", format='csv',sep=',',header = 'true')

但是,出现以下错误。

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-66-7473346bdbb1> in <module>()
----> 1 vl_assoc_rules_pred.repartition(1).write.mode('overwrite').save(path="s3a://ci-data-apps/rashid/vl-assoc-rules/vl_assoc_rules_pred.csv", format='csv',sep=',',header = 'true')

/usr/lib/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    593             self._jwrite.save()
    594         else:
--> 595             self._jwrite.save(path)
    596 
    597     @since(1.4)

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o840.save.
有人可以告诉我如何将该表写入csv,以保持每列的内容完整吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

假设“交易”是一个数据框,则可以尝试以下操作:

transactions.to_csv(file_name, sep=',')

将其另存为CSV。

可以使用spark-csv:

Spark 1.3

df.save('mycsv.csv', 'com.databricks.spark.csv')

火花1.4 +

df.write.format('com.databricks.spark.csv').save('mycsv.csv')

在Spark 2.0+中,您可以直接使用csv数据源:

df.write.csv('mycsv.csv')