如何减小Python创建的txt文件的大小?

时间:2015-09-23 14:30:18

标签: python pandas io pyodbc netezza

我在Netezza服务器上的表中有大约2M行x 70列的数值和分类数据,并希望使用Python将其转储到.txt文件中。 我以前用SAS做过这个,在我的测试用例中,我得到一个价值450MB的txt文件。 我使用Python并尝试了几件事。

# One line at a time

startTime = datetime.datetime.now().replace(microsecond=0)

cnxn = pyodbc.connect('DSN=NZ_LAB')
cursor = cnxn.cursor()
c = cursor.execute("""SELECT * FROM MYTABLE""")

with open('dump_test_pyodbc.csv','wb') as csv:
    csv.write(','.join([g[0] for g in c.description])+'\n')
    while 1:
        a=c.fetchone()
        if not a:
            break
        csv.write(','.join([str(g) for g in a])+'\n')
cnxn.close()

endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PYODBC:", endTime - startTime

>>Time elapsed PYODBC: 0:18:20



# Use Pandas chunksize
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')

sql = ("""SELECT * FROM MYTABLE""")

df = psql.read_sql(sql, cnxn, chunksize=1000)

for k, chunk in enumerate(df):
    if k == 0:
        chunk.to_csv('dump_chunk.csv',index=False,mode='w')
    else:
        chunk.to_csv('dump_chunk.csv',index=False,mode='a',header=False)

endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PANDAS:", endTime - startTime
cnxn.close()

>>Time elapsed PANDAS: 0:29:29

现在大小: Pandas方法创建了一个价值690MB的文件,另一种方法创建了一个价值630MB的文件。 速度和尺寸似乎有利于前一种方法,但是,尺寸方面,这仍然比原来的SAS方法大得多。 关于如何改进Python方法以减小输出大小的任何想法?

编辑:添加示例--------------------

好吧,似乎SAS在管理整数方面做得更好,这是有道理的。我认为这是构成大小差异的原因。

SAS: XXXXXX,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.49,40.65,63.31,1249.92。 ..

熊猫: XXXXXX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.49,40.65,63.31,1249.92。 ..

fetchone(): XXXXXX,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.49,40.65,63.31,1249.92。 ..

编辑2:解决方案------------------------------------

我最终删除了不必要的小数:

csv.write(','.join([str(g.strip()) if type(g)==str else '%g'%(g) for g in a])+'\n')

这使文件大小降至SAS级别。

1 个答案:

答案 0 :(得分:0)

我打算将此作为评论,但文字格式化会有所帮助。

我的猜测是您遇到引用与未引用的CSV文件问题。 SAS可以选择创建不带引号的CSV文件。这是一个例子:

This Value,That Value,3,Other Value,423,985.32

我认为您获得的文件更准确,并且不会为包含逗号的字段创建问题。引用同一行:

"This Value","That Value","3","Other Value","423,985.32"

正如您所看到的,在第一个(SAS)示例中,如果读入电子表格,它将读作两个不同的值," 423"和" 985.32"。在第二个例子中,很明显它实际上是一个值," 423,985.32"。这就是为什么你现在得到的引用格式(如果我是对的)更准确和安全。