Question

我正在尝试替换hive输出中的一些字符，以便Pandas可以正确地将其作为DataFrame读取。

我尝试的第一件事是：

f2 = gzip.open(local_path, 'rb')
table = f2.read()
f2.close()

table = table.replace('\x01','\t')
table = table.replace('\\N','NULL')

f = gzip.open(local_path,'wb')
f.write(table) <-----ERROR
f.close()

但是在上面用＃34; OverflowError：大小不适合int＆＃34;时，这个失败了。我的下一个想法就是这样做

input_file = gzip.open(local_path, 'rb')
output_file = gzip.open(output_path, 'wb')
for line in input_file:
    line = line.replace('\x01','\t')
    line = line.replace('\\N','NULL')
    output_file.write(line)
output_file.close()
input_file.close() 
os.rename(output_path,local_path)

但我担心它会很慢。有没有更好的方法呢？

如果它与解决方案相关，那么我可以打电话

return = pd.read_table(local_path,compression='gzip')

Pandas处理hive输出字符的时间非常糟糕，因此需要先明确地完成。

Answer 1

如果同时指定了na_values和分隔符，pandas会实际处理hive输出参数

df =  pd.read_table(local_path,compression='gzip',na_values='\\N',sep='\x01')

唯一可能的问题是以压缩格式保存。标准将是一个泡菜

df.to_pickle(output_path)

如果您遇到此问题：Pickling a DataFrame则必须将其另存为大文件。

df.to_csv(output_path)

替换大型gzip文件中的字符

1 个答案: