我在读取一个大的6gb单行json文件时遇到以下错误:
Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.io.IOException: Too many bytes before newline: 2147483648
spark不会使用新行读取json文件,因此整个6 gb json文件位于一行:
jf = sqlContext.read.json("jlrn2.json")
配置:
spark.driver.memory 20g
答案 0 :(得分:6)
是的,你的行中有超过[
{ [record] },
{ [record] }
]
个字节。你需要拆分它。
请记住,Spark期望每一行都是有效的JSON文档,而不是整个文件。以下是Spark SQL Progamming Guide
中的相关行请注意,作为json文件提供的文件不是典型的JSON文件。每行必须包含一个单独的,自包含的有效JSON对象。因此,常规的多行JSON文件通常会失败。
因此,如果您的JSON文档格式为......
{ [record] }
{ [record] }
您想要将其更改为
#!/bin/bash
export temp_mysql_password="**********"
export DEBIAN_FRONTEND=noninteractive
debconf-set-selections <<< "mysql-server mysql-server/root_password password $temp_mysql_password"
debconf-set-selections <<< "mysql-server mysql-server/root_password_again password $temp_mysql_password"
debconf-set-selections <<< "mysql-server-5.6 mysql-server/root_password password $temp_mysql_password"
debconf-set-selections <<< "mysql-server-5.6 mysql-server/root_password_again password $temp_mysql_password"
apt-get update
apt-get -y upgrade
apt-get -y install mysql-server
答案 1 :(得分:0)
我在PySpark中读取一个巨大的JSON文件并遇到相同的错误时偶然发现了这一点。因此,如果其他人也想知道如何以PySpark可以正确读取的格式保存JSON文件,下面是使用pandas的快速示例:
import pandas as pd
from collections import dict
# create some dict you want to dump
list_of_things_to_dump = [1, 2, 3, 4, 5]
dump_dict = defaultdict(list)
for number in list_of_things_to_dump:
dump_dict["my_number"].append(number)
# save data like this using pandas, will work of the bat with PySpark
output_df = pd.DataFrame.from_dict(dump_dict)
with open('my_fancy_json.json', 'w') as f:
f.write(output_df.to_json(orient='records', lines=True))
之后,在PySpark中加载JSON就像这样简单:
df = spark.read.json("hdfs:///user/best_user/my_fancy_json.json", schema=schema)