Question

尝试使用OpenDAP和xarray将多个高分辨率netCDF文件加载到单个数据集中；然后通过经纬度坐标将时间序列数据提取到单个CSV文件中。我正在使用xarray和dask。

我可以编写一个CSV，但是循环遍历整个数据集将需要1200多个小时。

是选择单个点还是所有点，然后将其全部转换为CSV文件？

import xarray as xr


Variables = collections.namedtuple('Variables', ['new_name', 'conversion', 'units'])
toMJm2 = 3.6e-3
toDegC = 273.15
tomm = 1.0

VARS_DICT = {"rsds": Variables('SRAD', toMJm2, "MJ m**-2"),
             "tasmax": Variables("TMax", toDegC, "C"),
             "tasmin": Variables("TMin", toDegC, "C"),
             "pr": Variables("Rain", tomm, "mm")}


def open_rename_ds(file):
    ds = xr.open_dataset(file, chunks={"lat": 5, "lon": 10, "time": -1})
    ds = ds.squeeze('crs').drop('crs')
    ds = ds.assign_coords(lon=(((ds.lon + 180) % 360) - 180))
    var_name = os.path.basename(file).split("_")[2]
    v = VARS_DICT[var_name]
    for k in ds.data_vars:
        if v.conversion == 273.15:
            ds[v.new_name] = ds[k] - v.conversion
        else:
            ds[v.new_name] = ds[k] * v.conversion
        ds[v.new_name].attrs["units"] = v.units
    _ds = ds.drop([k])
    return _ds.sel(time=slice("2020-01-01","2099-12-30"))

fileslist = ['http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_macav2metdata_tasmax_bcc-csm1-1_r1i1p1_rcp85_2006_2099_CONUS_daily.nc',
'http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_macav2metdata_tasmin_bcc-csm1-1_r1i1p1_rcp85_2006_2099_CONUS_daily.nc', 'http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_macav2metdata_pr_bcc-csm1-1_r1i1p1_rcp85_2006_2099_CONUS_daily.nc', 'http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_macav2metdata_rsds_bcc-csm1-1_r1i1p1_rcp85_2006_2099_CONUS_daily.nc']

dss = [open_rename_ds(file) for file in fileslist]
# Create a single dataset for the four opened datasets using `.update`
ds = dss[0]
for d in dss[1:]:
    ds.update(d)

# drop the attributes for brevity
ds.attrs = {}

以下是数据集的大小和信息：

print("ds size in GB {:0.2f}\n".format(ds.nbytes / 1e9))
print(ds.info())

ds size in GB 379.06

xarray.Dataset {
dimensions:
    lat = 585 ;
    lon = 1386 ;
    time = 29219 ;

variables:
    float64 lat(lat) ;
        lat:long_name = latitude ;
        lat:standard_name = latitude ;
        lat:units = degrees_north ;
        lat:axis = Y ;
        lat:description = Latitude of the center of the grid cell ;
    float64 lon(lon) ;
    datetime64[ns] time(time) ;
        time:description = days since 1900-01-01 ;
    float32 TMax(time, lat, lon) ;
        TMax:units = C ;
    float32 TMin(time, lat, lon) ;
        TMin:units = C ;
    float32 Rain(time, lat, lon) ;
        Rain:units = mm ;
    float32 SRAD(time, lat, lon) ;
        SRAD:units = MJ m**-2 ;

// global attributes:
}

需要删除所有nan纬度/经度，因为这会减小文件的大小。

我正在尝试类似的事情：

dd = ds.isel(lat=list_of_valid_lats, lon=list_of_valid_lons).to_dataframe()
dd.to_csv("testing_filename.csv.gz", index=False, compression="gzip")

这花了很长时间，而且经常使我的计算机崩溃。我也尝试使用to_dask_dataframe()

我也尝试过：

df = ds.isel(lat=0, lon=1049).to_dataframe().reset_index()
df.to_csv("testing_filename_csv.gz", index=False, compression="gzip")

这可以工作，但是当超过40万个网格点时确实很慢。

对于大约40个类似的文件，我也需要重复此操作。

任何提高性能的想法或想法都会受到赞赏吗？

Python：使用xarray从高分辨率netCDF数据中提取时间序列数据的CSV

0 个答案: