我正在尝试从pyspark并行化的函数中写入一个csv文件,当我遇到的问题是,每次我写csv时,标题都会被附加一次。
我尝试使用fd.tell()检查文件中是否已经写入了某些数据,因此如果存在数据,我就不会再次写入标头。但是fd.tell()在循环中工作正常,但在pyspark中工作良好并行化。
def process(data, func):
sc = SparkConnection().spark_context()
comps = sc.parallelize(data, len(data))
output = comps.map(func)
output.collect()
sc.stop()
def compute_growth_stats(group, file_name):
entity_id, data = group[0], group[1]
with open(file_name, 'a') as fd:
for i in range(2, 6):
group['window'] = i
group['shifted'] = group.value.shift(i)
group['growth'] = (group.value - group.shifted)
group.to_csv(fd, index =False, header=fd.tell() == 0)
def dump_growth_stats(data, file_name):
data['date'] = pd.to_datetime(data['date'])
grouped = list(data.groupby('entity_id'))
func = partial(GrowthStats.compute_growth_stats, file_name)
process(grouped, func)