Python脚本运行时间太长了?

时间:2018-01-23 17:59:37

标签: python pandas csv

我正在编写一个基本上执行以下操作的python脚本

  1. 将CSV文件读取为数据框对象。
  2. 根据名称选择一些列并将它们存储在新的DF对象中。
  3. 对单元格中的值进行一些数学和字符串操作。我在这里使用for循环和iterrows()方法。
  4. 将修改后的DF写入CSV
  5. 使用for循环将CSV写入json。
  6. 此代码需要永远运行。我试图理解为什么这需要这么长时间,如果我应该以不同的方式完成我的任务以加快执行速度。

    import pandas
    import json
    import pendulum
    import csv
    import os
    import time
    
    start_time = time.time()
    print("--- %s seconds ---" % (time.time() - start_time))
    
    os.chdir('/home/csv_files_from_REC')
    df11 = pandas.read_csv('RTP_Gap_2018-01-21.csv') ### Reads the CSV FILE
    
    print df11.shape ### Prints the shape of the DF
    
    ### Filter the initial DF by selecting some columns based on NAME
    df1 = df11[['ENODEB','DAY','HR','SITE','RTP_Gap_Length_Total_sec','RTP_Session_Duration_Total_sec','RTP_Gap_Duration_Ratio_Avg%']]
    
    print df1.shape ## Prints Shape
    
    #### Math and String manupulation stuff ###
    for index, row in df1.iterrows():
        if row['DAY'] == 'Total':
            df1.drop(index, inplace=True)
        else:
            stamp = row['DAY'] + ' ' + str(row['HR']) + ':00:00'
            sitename = str(row['ENODEB'])+'_'+row['SITE']
            if row['RTP_Session_Duration_Total_sec'] == 0:
                rtp_gap = 0
            else:
                rtp_gap = row['RTP_Gap_Length_Total_sec']/row['RTP_Session_Duration_Total_sec']
            time1 = pendulum.parse(stamp,tz='America/Chicago').isoformat()
            df1.loc[index,'DAY'] = time1
            df1.loc[index,'SITE'] = sitename
            df1.loc[index,'HR'] = rtp_gap
    
    ### Write DF to CSV ###
    df1.to_csv('RTP_json.csv',index=None)
    json_file_ind = 'RTP_json.json'
    file = open(json_file_ind, 'w')
    file.write("")
    file.close()
    
    #### Write CSV to JSON ###
    with open('RTP_json.csv', 'r') as csvfile:
        reader_ind = csv.DictReader(csvfile)
        row=[]
        for row in reader_ind:         
            row["RTP_Gap_Length_Total_sec"] = float(row["RTP_Gap_Length_Total_sec"])
            row["RTP_Session_Duration_Total_sec"] = float(row["RTP_Session_Duration_Total_sec"])
                    row["RTP_Gap_Duration_Ratio_Avg%"]=float(row["RTP_Gap_Duration_Ratio_Avg%"])
            row["HR"] = float(row["HR"])
            with open('RTP_json.json', 'a') as json_file_ind:
                json.dump(row, json_file_ind)
                json_file_ind.write('\n')
    
     end_time = time.time()
     print("--- %s seconds ---" % (time.time() - end_time))
    

    输出

        --- 2018-01-23T12:25:07.411691-06:00 seconds ---### START TIME
        (2055, 36) ### SIZE of initial DF
        (2055, 7) ### Size of Filtered DF
        --- 2018-01-23T12:31:54.480568-06:00 seconds --- --- ### END TIME
    

1 个答案:

答案 0 :(得分:0)

这篇文章可以显着加快您的数据框计算

import numpy as np

df1 = df11[['ENODEB','DAY','HR','SITE','RTP_Gap_Length_Total_sec','RTP_Session_Duration_Total_sec','RTP_Gap_Duration_Ratio_Avg%']]

print df1.shape ## Prints Shape

df1 = df1[df1.DAY != 'Total'].reset_index()
df1['DAY'] = pendulum.parse(df1['DAY'] + ' ' + str(df1['HR']) + ':00:00',tz='America/Chicago').isoformat()
df1['SITE'] = str(df1['ENODEB'])+'_'+df1['SITE']
df1['HR'] = np.where(df1['RTP_Session_Duration_Total_sec']==0,0,df1['RTP_Gap_Length_Total_sec']/df1['RTP_Session_Duration_Total_sec'])

另外,为什么还要写csv并再次阅读它。

将df格式化为json格式

format_json =  df1.to_json(orient='records') # converts df to json list
json_file_ind = 'RTP_json.json'
file = open(json_file_ind, 'w')
for i in format_json:
    file.write(i)
    file.write('\n')

这会显着加快您的代码