优化从.gz文件和cpu利用率python读取

时间:2018-03-05 12:39:57

标签: python performance file cpu-usage python-2.6

示例输入文件:

83,REGISTER,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL                                                                                             
83,INVITE,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,INVITE,0,10.166.224.34,1518814163,[sip:1202677@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202678@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL
83,REGISTER,0,10.166.224.34,1518814163,[sip:1202687@mobile.com],sip:1202977@mobile.com,3727925550,0600,NULL,NULL

示例输出文件:

1202677 REGISTER,INVITE
1202687 INVITE,REGISTER
1202678 REGISTER

代码示例:

filesList=glob.glob("%s/*.gz" %(sys.argv[1]))

for file in filesList:
    try:
        fp = gzip.open(file, 'rb')
        f=fp.readlines()
        fp.close()
        for line in f:
            line = line.split(',')
            if line[0] == '83':
                str=line[5].split("[sip:")
                if len(str) > 1:
                    str=str[1].split("@")
                if dict.has_key(str[0].strip()):
                    dict[str[0].strip()] = dict.get(str[0].strip())+','+line[1]
                else:
                    dict[str[0].strip()] = line[1]
    except:
        print "Unexpected Error: ", sys.exc_info()[0]

try:
    with open(sys.argv[2],'w') as s:
        for num in dict:
            print >> s, num,dict[num]
except:
    print "Unexpected error:", sys.exc_info()[0]

当我在 2.1GB(430个文件)加载的情况下运行上面的脚本时,执行大约需要13分钟,CPU利用率约为100%。

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                              
12586 root      20   0  156m 134m 1808 R 99.8  0.2   0:40.17 script

请告诉我如何优化上述代码以减少执行时间。感谢

1 个答案:

答案 0 :(得分:0)

试试dask.dataframe。如果这仍然太慢,则有工具,例如df = pd.concat([pd.read_csv(f, header=None, usecols=[1, 5]) for f in files]) df[5] = df[5].str.split(':|@').apply(lambda x: x[1]) result = df.groupby(5)[1].apply(list) # 5 # 1202677 [REGISTER, INVITE] # 1202678 [REGISTER] # 1202687 [INVITE, REGISTER] # Name: 1, dtype: object ,这可以提高效率。

"scripts": {
    ...
    "prepare": "npm explore core_module -- npm install"
}