应用错误收集

并行进程处理读取一堆文件并保存到一个Pandas.Dataframe中

时间：2016-10-14 09:32:10

标签： python pandas multiprocessing netcdf

背景

对于某些感兴趣的地方，我们希望从开放数据源中提取一些有用的信息。以气象数据为例，我们只想识别一点的长期时间模式，但数据文件有时会覆盖整个单词。

在这里，我使用Python在一年内每6小时提取FNL（ds083.2）文件中一个点的垂直速度。

换句话说，我想读取原始数据并通过时间线保存目标变量

我的尝试

import numpy as np
from   netCDF4 import Dataset
import pygrib
import pandas as pd
import os, time, datetime 

# Find the corresponding grid box
def find_nearest(array,value): 
    idx = (np.abs(array-value)).argmin()
    return array[idx]

## Obtain the X, Y indice
site_x,site_y =116.4074, 39.9042  ## The area of interests.
grib='./fnl_20140101_06_00.grib2' ## Any files for obtaining lat-lon
grbs=pygrib.open(grib)
grb = grbs.select(name='Vertical velocity')[8]#
lon_list,lat_list = grb.latlons()[1][0],grb.latlons()[0].T[0]
x_indice = np.where(lon_list == find_nearest(lon_list, site_x))[0]
y_indice = np.where(lat_list == find_nearest(lat_list, site_y))[0]

def extract_vm():
    files = os.listdir('.') ### All files have already save in one path
    files.sort()
    dict_vm = {"V":[]} 
    ### Travesing the files
    for file in files[1:]:
        if file[-5:] == "grib2":
            grib=file
            grbs=pygrib.open(grib)
            grb = grbs.select(name='Vertical velocity')[4] ## Select certain Z level
            data=grb.values
            data = data[y_indice,x_indice]
            dict_vm['V'].append(data)

    ff = pd.DataFrame(dict_vm)
    return ff

extract_vm()

我的想法

如何加快阅读过程？现在，我使用线性读取方法，实现时间将随着处理时间周期线性增加我们可以将这些文件拆分到多个集群中，并使用多核处理器分别处理它们。我的代码是否有任何其他建议可以提高速度？

任何评论都将不胜感激！

0 个答案:

没有答案