在寻找SAS排序的替代方案时,我决定尝试使用Python 2.6(两者都在同一台Unix服务器上)。在SAS中,对一个狭窄的500ml行表进行排序需要20分钟。 我将表的20%(100mln行)导出到CSV文件,如下所示:
X|||465097434|912364420|0.00|0.00|0.00|0.00|1.00|01FEB2016|X|0|0
X|||465097434|912364420|0.00|0.00|0.00|0.00|0.00|02FEB2016|X|0|0
X|||465097434|912364420|0.00|0.00|0.00|0.00|2.00|03FEB2016|X|0|0
X|||465097434|912364421|0.00|0.00|0.00|0.00|3.00|04FEB2016|X|0|0
X|||465097434|912364421|0.00|0.00|0.00|0.00|6.00|05FEB2016|X|0|0
X|||965097411|912364455|0.00|0.00|0.00|0.00|4.00|04FEB2016|X|0|0
X|||965097411|912364455|0.00|0.00|0.00|0.00|1.00|05FEB2016|X|0|0
目标是按第5和第11列排序。 首先,我检查了python使用代码读取文件的速度:
from __future__ import print_function
import csv
import time
linesRead=0
with open ('/path/to/file/CSV_FILE.csv','r') as dailyFile:
allLines=csv.DictReader(dailyFile, delimiter='|')
startTime=time.time()
for row in allLines:
linesRead += 1
if (linesRead) % 1000000 == 0:
print(linesRead, ": ", time.time()-startTime, " sec.")
startTime=time.time()
结果是每个行读取需要6秒钟。
1000000 : 6.6301009655 sec.
2000000 : 6.33900094032 sec.
3000000 : 6.26246404648 sec.
4000000 : 6.56919789314 sec.
5000000 : 6.17433309555 sec.
...
98000000 : 6.61627292633 sec.
99000000 : 7.14683485031 sec.
100000000 : 7.08069109917 sec.
所以我扩展了代码以将其加载到字典(key =第5列(帐户标识符)),value是此帐户的列表(行)列表。 这就是我意识到,当字典增长时,将字典加载到字典会变慢(非常逻辑,因为有越来越多的密钥需要检查):
import csv
import time
myDictionary = {}
linesRead=0
with open ('/path/to/file/CSV_FILE.csv','r') as dailyFile:
allLines=csv.DictReader(dailyFile, delimiter='|')
startTime=time.time()
for row in allLines:
accountID=row['account_id'].strip('\'')
linesRead += 1
if accountID in myDictionary:
myDictionary[accountID].append([row['date'].strip('\''), row['balance1'], row['balance2'], row['balance3']])
else:
myDictionary[accountID]=[]
if (linesRead) % 1000000 == 0:
print(linesRead, ": ", time.time()-startTime, " sec.")
startTime=time.time()
时间是:
1000000, ': ', 8.9685721397399902, ' sec.')
(2000000, ': ', 10.344831943511963, ' sec.')
(3000000, ': ', 11.637137889862061, ' sec.')
(4000000, ': ', 13.024128913879395, ' sec.')
(5000000, ': ', 13.508150815963745, ' sec.')
(6000000, ': ', 14.94166088104248, ' sec.')
(7000000, ': ', 16.307464122772217, ' sec.')
(8000000, ': ', 17.130259990692139, ' sec.')
(9000000, ': ', 17.54616379737854, ' sec.')
(10000000, ': ', 20.254321813583374, ' sec.')
...
(39000000, ': ', 55.350741863250732, ' sec.')
(40000000, ': ', 56.762171983718872, ' sec.')
(41000000, ': ', 57.876702070236206, ' sec.')
(42000000, ': ', 54.548398017883301, ' sec.')
(43000000, ': ', 60.040227890014648, ' sec.')
这意味着没有机会在合理的时间内加载500mln行(500万的最后一百万将加载600秒)。 我的猜测是每次迭代中最慢的部分是检查字典中的密钥存在:
if accountID in myDictionary:
所以我把字典改成了列表,希望简单的追加会更快:
with open ('/path/to/file/CSV_FILE.csv','r') as dailyFile:
allLines=csv.DictReader(dailyFile, delimiter='|')
startTime=time.time()
for row in allLines:
linesRead += 1
myList.append([row['account_id'].strip('\''), row['date'].strip('\''), row['balance1'], row['balance2'], row['balance3']])
if (linesRead) % 1000000 == 0:
print(linesRead, ": ", time.time()-startTime, " sec.")
startTime=time.time()
不幸的是,性能没有提高:
1000000 : 9.15476489067 sec.
2000000 : 10.3512279987 sec.
3000000 : 12.2600080967 sec.
4000000 : 13.5473120213 sec.
5000000 : 14.8431830406 sec.
6000000 : 16.5556428432 sec.
7000000 : 17.6754620075 sec.
8000000 : 19.1299819946 sec.
9000000 : 19.7615978718 sec.
10000000 : 22.5903761387 sec.
加载列表不应该比在入口处使用密钥检查加载字典快得多吗?
我是否误用python来处理这类数据? 为了进行比较,我使用unix sort命令对文件进行了排序:
$ date ; sort -t'|' -k5,9 CSV_FILE.csv > delete.txt; date;
Sun Jul 23 18:46:16 CEST 2017
Sun Jul 23 19:06:53 CEST 2017
花了20分钟完成这项工作。在python中,我无法将数据加载到内存中。
答案 0 :(得分:1)
我建议pandas
,因为它应该更快。这将是阅读csv
文件的代码:
import pandas as pd
df = pd.read_csv('/path/to/file/CSV_FILE.csv', sep='|')
要对它进行排序,您可以使用:
df.sort_values([4, 10], ascending=[True,True], inplace=True)
注意:第一个列表是列名,其他参数是不言自明的。