Question

我有以下DataFrame

+--------------------+
|                  _1|
+--------------------+
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
|{"entry": {"@type...|
+--------------------+
only showing top 20 rows

每行包含有效的JSON。我会活着保存这个以便我有一个文件，最好是JSON，它只是一个对象的嵌套（上面的这些行）。然而，我正在使用

获取JSON对象

{"_1":"{"entry": {"@type...}

我想要

{"entry": {"@type...}
{"entry": {"@type...}
{"entry": {"@type...}

Answer 1

最简单的方法之一是转换为rdd并仅选择值为

rdd = df.rdd.map(lambda row: row._1)

然后您可以将rdds转换为dataframe并将其保存为

sqlContext.read.json(rdd).write.json('output path to json')

或者您可以将它们直接保存到文本json文件

rdd.saveAsTextFile('path to text json file')

我希望答案很有帮助

如何在没有Column的情况下保存数据帧？

1 个答案: