我在Netezza服务器上的表中有大约2M行x 70列的数值和分类数据,并希望使用Python将其转储到.txt文件中。 我以前用SAS做过这个,在我的测试用例中,我得到一个价值450MB的txt文件。 我使用Python并尝试了几件事。
# One line at a time
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
cursor = cnxn.cursor()
c = cursor.execute("""SELECT * FROM MYTABLE""")
with open('dump_test_pyodbc.csv','wb') as csv:
csv.write(','.join([g[0] for g in c.description])+'\n')
while 1:
a=c.fetchone()
if not a:
break
csv.write(','.join([str(g) for g in a])+'\n')
cnxn.close()
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PYODBC:", endTime - startTime
>>Time elapsed PYODBC: 0:18:20
# Use Pandas chunksize
startTime = datetime.datetime.now().replace(microsecond=0)
cnxn = pyodbc.connect('DSN=NZ_LAB')
sql = ("""SELECT * FROM MYTABLE""")
df = psql.read_sql(sql, cnxn, chunksize=1000)
for k, chunk in enumerate(df):
if k == 0:
chunk.to_csv('dump_chunk.csv',index=False,mode='w')
else:
chunk.to_csv('dump_chunk.csv',index=False,mode='a',header=False)
endTime = datetime.datetime.now().replace(microsecond=0)
print "Time elapsed PANDAS:", endTime - startTime
cnxn.close()
>>Time elapsed PANDAS: 0:29:29
现在大小: Pandas方法创建了一个价值690MB的文件,另一种方法创建了一个价值630MB的文件。 速度和尺寸似乎有利于前一种方法,但是,尺寸方面,这仍然比原来的SAS方法大得多。 关于如何改进Python方法以减小输出大小的任何想法?
编辑:添加示例--------------------
好吧,似乎SAS在管理整数方面做得更好,这是有道理的。我认为这是构成大小差异的原因。
SAS: XXXXXX,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.49,40.65,63.31,1249.92。 ..
熊猫: XXXXXX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.49,40.65,63.31,1249.92。 ..
fetchone(): XXXXXX,0.00,0.00,0,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,2.49,40.65,63.31,1249.92。 ..
编辑2:解决方案------------------------------------
我最终删除了不必要的小数:
csv.write(','.join([str(g.strip()) if type(g)==str else '%g'%(g) for g in a])+'\n')
这使文件大小降至SAS级别。
答案 0 :(得分:0)
我打算将此作为评论,但文字格式化会有所帮助。
我的猜测是您遇到引用与未引用的CSV文件问题。 SAS可以选择创建不带引号的CSV文件。这是一个例子:
This Value,That Value,3,Other Value,423,985.32
我认为您获得的文件更准确,并且不会为包含逗号的字段创建问题。引用同一行:
"This Value","That Value","3","Other Value","423,985.32"
正如您所看到的,在第一个(SAS)示例中,如果读入电子表格,它将读作两个不同的值," 423"和" 985.32"。在第二个例子中,很明显它实际上是一个值," 423,985.32"。这就是为什么你现在得到的引用格式(如果我是对的)更准确和安全。