我正在将单个zlib压缩文件中到达的原始记录转换为丰富的镶木地板记录,以供以后在Spark中处理。我不控制zlib文件,并且需要与其他处理保持一致的实木复合地板。我正在Pyspark和Spark 2.3中工作。我的方法行得通,除非zlib文件很大(〜300MB)。它可以很好地保存在内存中,但是Spark内存不足。如果我将驱动程序内存向上推(8g),它将起作用。如下所示,使用函数调用感觉就像是内存泄漏。
def process_file(gzfile, spark, chunk_size=2000000):
# load_data_and decompress
data = load_original_data(gzfile)
if len(data) == 0:
raise ValueError("No records loaded from file ", gzfile)
chunks = len(data) // chunk_size + 1
offset = 0
for chunk in range(chunks):
# convert the chunk into a spark dataframe
df = raw_to_spark(data[offset:offset+chunk_size], spark)
offset += chunk_size
# enrich the data while in a spark dataframe w/ more columns
df = extract_fields_from_raw(df)
save_to_parquet(df, parquet_output_path)
def raw_to_spark(events: List[str], spark: pyspark.sql.SparkSession) -> pyspark.sql.DataFrame:
convert the list of raw strings into a spark dataframe so we can do all the processing. this list is large in
memory so we pop one list while building the new one. then throw the new list into spark.
schema = StructType([StructField("event", StringType())])
rows = [] # make the list smaller as we create the row list for the dataframe
while events:
event = events.pop()
if event.count(",") >= 6: # make sure there are 7 fields at least
rdd = spark.sparkContext.parallelize(rows, numSlices=2000) # we need to partition in order to pass to workers
return spark.createDataFrame(rdd, schema=schema)
def extract_fields_from_raw(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
this adds columns to the dataframe for enrichment before saving
因此,当我循环遍历解压缩的数据时,正在创建大小为<2M的Spark数据帧。这些数据帧中的每一个都位于4g驱动程序空间中应该没有问题。我收到相同的错误,例如,如果我要使用1M的块。 Spark日志显示内存不足错误在失败之前将具有相同的内存消耗,例如,使用了4.5gb的内存。