我用xarray导入了不同的netCDF文件,最终需要将它们全部转换为一个熊猫数据框。这是一个包含天气数据的文件,随着时间的推移,某些纬度和经度会丢失许多观测值(因为它们位于海洋中部)。坐标:纬度,经度,时间;变量:Temp,Pre。 在转换为数据框之前,我想摆脱这些缺失的观察值/整个坐标。有没有简单有效的方法可以用xarray做到这一点?我在文档中什么都没找到。
import pandas as pd
import xarray as xr
path = 'Z:/Research/Climate_change/Climate_extreme_index/CRU data/'
temp_data = path+'cru_ts4.01.1901.2016.tmp.dat.nc'
pre_data = path+'cru_ts4.01.1901.2016.pre.dat.nc'
# Open netcdf
def open_netcdf(datapath):
print("Loading data...")
data = xr.open_dataset(datapath, autoclose=True, drop_variables='stn', cache=True)
return data
# Merge dataframes
data_temp = open_netcdf(temp_data)
data_pre = open_netcdf(pre_data)
all_data = xr.merge([data_temp, data_pre])
#################################################################
<xarray.Dataset>
Dimensions: (lat: 360, lon: 720, time: 1392)
Coordinates:
* lon (lon) float32 -179.75 -179.25 -178.75 -178.25 -177.75 -177.25 ...
* lat (lat) float32 -89.75 -89.25 -88.75 -88.25 -87.75 -87.25 -86.75 ...
* time (time) datetime64[ns] 1901-01-16 1901-02-15 1901-03-16 ...
Data variables:
tmp (time, lat, lon) float32 ...
pre (time, lat, lon) float32 ...
#########################################################
#Dataframe example
tmp pre
lat lon time
-89.75 -179.75 1901-01-16 NaN NaN
1901-02-15 NaN NaN
1901-03-16 NaN NaN
1901-04-16 NaN NaN
1901-05-16 NaN NaN
1901-06-16 NaN NaN
1901-07-16 NaN NaN
1901-08-16 NaN NaN
1901-09-16 NaN NaN
1901-10-16 NaN NaN
1901-11-16 NaN NaN
答案 0 :(得分:0)
有dropna功能,例如
all_data.dropna('time', how='all')
但是到目前为止,它只是一次沿一个维度实现,因此我不确定它是否能满足您的要求。我了解您要删除所有时间都是nan的经纬度对吗?我认为您必须将经纬度转换为大熊猫multiindex坐标,然后沿这个新维度使用dropna。
答案 1 :(得分:0)
简短的答案是,在删除NaN之前将数据集转换为数据帧是正确的解决方案。
带有MultiIndex的熊猫数据框和xarray数据集之间的主要区别之一是,某些索引元素(时间/纬度/经度组合)可以在MultiIndex中删除,而无需删除所有实例时间,纬度或经度用NaN表示。另一方面,DataArray将每个维度(时间,纬度和经度)建模为正交,这意味着如果不删除数组的整个切片,则不能删除NaN。这是xarray数据模型的核心功能。
作为示例,这是一个与数据结构相匹配的小型数据集:
In [1]: import pandas as pd, numpy as np, xarray as xr
In [2]: ds = xr.Dataset({
...: var: xr.DataArray(
...: np.random.random((4, 3, 6)),
...: dims=['time', 'lat', 'lon'],
...: coords=[
...: pd.date_range('2010-01-01', periods=4, freq='Q'),
...: np.arange(-60, 90, 60),
...: np.arange(-180, 180, 60)])
...: for var in ['tmp', 'pre']})
...:
我们可以创建一个假的陆地遮罩,在每个时间段内NaN都会消除特定的经纬度组合
In [3]: land_mask = (np.random.random((1, 3, 6)) > 0.3)
In [4]: ds = ds.where(land_mask)
In [5]: ds.tmp
Out[5]:
<xarray.DataArray 'tmp' (time: 4, lat: 3, lon: 6)>
array([[[0.020626, 0.937496, nan, 0.052608, 0.266924, 0.361297],
[0.299442, 0.524904, 0.447275, 0.277471, nan, 0.595671],
[0.541777, 0.279131, nan, 0.282487, nan, nan]],
[[0.473278, 0.302622, nan, 0.664146, 0.401243, 0.949998],
[0.225176, 0.601039, 0.543229, 0.144694, nan, 0.196285],
[0.059406, 0.37001 , nan, 0.867737, nan, nan]],
[[0.571011, 0.864374, nan, 0.123406, 0.663951, 0.684302],
[0.867234, 0.823417, 0.351692, 0.46665 , nan, 0.215644],
[0.425196, 0.777346, nan, 0.332028, nan, nan]],
[[0.916069, 0.54719 , nan, 0.11225 , 0.560431, 0.22632 ],
[0.605043, 0.991989, 0.880175, 0.3623 , nan, 0.629986],
[0.222462, 0.698494, nan, 0.56983 , nan, nan]]])
Coordinates:
* time (time) datetime64[ns] 2010-03-31 2010-06-30 2010-09-30 2010-12-31
* lat (lat) int64 -60 0 60
* lon (lon) int64 -180 -120 -60 0 60 120
您会看到在不丢失有效数据的情况下不能删除经纬度索引。另一方面,当数据转换为DataFrame时,纬度/经度/时间维度会堆叠在一起,这意味着可以删除此索引中的单个元素而不会影响其他行:
In [6]: ds.to_dataframe()
Out[6]:
tmp pre
lat lon time
-60 -180 2010-03-31 0.020626 0.605749
2010-06-30 0.473278 0.192560
2010-09-30 0.571011 0.850161
2010-12-31 0.916069 0.415747
-120 2010-03-31 0.937496 0.465283
2010-06-30 0.302622 0.492205
2010-09-30 0.864374 0.461739
2010-12-31 0.547190 0.755914
-60 2010-03-31 NaN NaN
2010-06-30 NaN NaN
2010-09-30 NaN NaN
2010-12-31 NaN NaN
0 2010-03-31 0.052608 0.529258
2010-06-30 0.664146 0.116303
2010-09-30 0.123406 0.389693
... ... ...
60 120 2010-03-31 NaN NaN
2010-06-30 NaN NaN
2010-09-30 NaN NaN
2010-12-31 NaN NaN
[72 rows x 2 columns]
在此DataFrame上调用dropna()
时,不会删除任何数据:
In [7]: ds.to_dataframe().dropna(how='all')
Out[7]:
tmp pre
lat lon time
-60 -180 2010-03-31 0.020626 0.605749
2010-06-30 0.473278 0.192560
2010-09-30 0.571011 0.850161
2010-12-31 0.916069 0.415747
-120 2010-03-31 0.937496 0.465283
2010-06-30 0.302622 0.492205
2010-09-30 0.864374 0.461739
2010-12-31 0.547190 0.755914
0 2010-03-31 0.052608 0.529258
2010-06-30 0.664146 0.116303
2010-09-30 0.123406 0.389693
2010-12-31 0.112250 0.485259
60 2010-03-31 0.266924 0.795056
2010-06-30 0.401243 0.299577
2010-09-30 0.663951 0.359567
2010-12-31 0.560431 0.933291
... ... ...
60 0 2010-03-31 0.282487 0.148216
2010-06-30 0.867737 0.643767
2010-09-30 0.332028 0.471430
2010-12-31 0.569830 0.380992