在python中更快地爬行数据

时间:2018-11-07 14:02:10

标签: python multithreading

我正在抓取25GB bz2文件的数据。现在,我正在处理zip文件,将其打开,获取传感器的数据,获取中值,然后在完成所有文件的处理后,将它们写入excel文件。处理这些文件需要一整天,这是不可接受的。

我想使过程更快,所以我想触发尽可能多的线程,但是我将如何解决该问题呢?这个想法的伪代码会很好。

我要考虑的问题是我有zip文件每一天的时间戳。 因此,例如,我的day1在20:00,我需要处理它的文件,然后将其保存在列表中,而其他线程可以处理其他日期,但是我需要按照顺序将数据同步到磁盘中的写入文件中。

基本上我想更快地加速它。

这是过程文件的伪代码,如答案所示

def proc_file(directoary_names):
    i = 0

    try:

        for idx in range(len(directoary_names)):
            print(directoary_names[idx])
            process_data(directoary_names[idx], i, directoary_names)
            i = i + 1
    except KeyboardInterrupt:
       pass

    print("writing data")
    general_pd['TimeStamp'] = timeStamps
    general_pd['S_strain_HOY'] = pd.Series(S1)
    general_pd['S_strain_HMY'] = pd.Series(S2)
    general_pd['S_strain_HUY'] = pd.Series(S3)
    general_pd['S_strain_ROX'] = pd.Series(S4)
    general_pd['S_strain_LOX'] = pd.Series(S5)
    general_pd['S_strain_LMX'] = pd.Series(S6)
    general_pd['S_strain_LUX'] = pd.Series(S7)
    general_pd['S_strain_VOY'] = pd.Series(S8)
    general_pd['S_temp_HOY'] = pd.Series(T1)
    general_pd['S_temp_HMY'] = pd.Series(T2)
    general_pd['S_temp_HUY'] = pd.Series(T3)
    general_pd['S_temp_LOX'] = pd.Series(T4)
    general_pd['S_temp_LMX'] = pd.Series(T5)
    general_pd['S_temp_LUX'] = pd.Series(T6)
    writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
    # Convert the dataframe to an XlsxWriter Excel object.
    general_pd.to_excel(writer, sheet_name='Sheet1')
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

Sx到Tx是sesnor值。

1 个答案:

答案 0 :(得分:3)

使用multiprocessing,看来您的任务很简单。

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
    # Process it
    l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel. 

更新:由于您想保留文件名(也许作为熊猫列),因此,方法如下:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
    # Process it
    d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

如果每个文件都会产生多个值,则可以创建一个字典,其中每个元素都是包含这些值的列表