Question

我有一个320行的数据帧。我用熊猫把它转换成了ndjson：

df.to_json('file.json', orient='records', lines=True)

但是在加载数据时，我只获得了200行。

with open('file.json') as f:
    print(len(f.readlines()))

给出200

spark.read.json('file.json').count

也给出200

只有使用pandas重新加载它才能得到正确的行数：

pd.read_json('file.json', orient='records', lines=True)

我的数据集在字段中包含\n个字符。当我用python或spark加载记录时，我希望有更多或更多的行。

pandas.to_json方法有什么问题？

Answer 1

我逐行手动检查了json文件，发现pandas.to_json似乎编写错误。（或者我误解了规范）

with open('file.json') as f:
    j = f.read().replace('},{', '}\n{')
with open('file.jsonl', 'w') as f:
    f.write(j)

替换文件中的错误可以解决问题。