我正在将CSV文件导入到Pandas数据框。 CSV文件类似于:
Time, Status, Variable, freq_1, freq_2, freq_3, .....
1/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
1/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
1/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
1/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
2/1/2000, Hi, A, 0.1, 3.3, 8.1, ....
2/1/2000, Hi, B, 2.4, 1.2, 1.3, ....
2/1/2000, Lo, A, 4.5, 6.9, 6.4, ....
2/1/2000, Lo, B, 7.1, 8.8, 2.3, ....
....
我使用时间,状态和变量作为指标将其读取到具有多索引的数据框中。
我现在想使用Pandas to_xarray或Xarray from_dataframe将数据框导入Xarray。但是,这两种方法似乎都使索引阻塞,从而引发错误:
TypeError: Could not convert tuple of form (dims, data[, attrs, encoding]): (0, DatetimeIndex(['2018-01-12 00:15:00', '2018-01-12 00:45:00',
'2018-01-12 01:15:00', '2018-01-12 01:45:00',
'2018-01-12 02:15:00', '2018-01-12 02:45:00',
'2018-01-12 03:15:00', '2018-01-12 03:45:00',
'2018-01-12 04:15:00', '2018-01-12 04:45:00',
...
'2019-11-01 16:15:00', '2019-11-01 17:15:00',
'2019-11-01 17:45:00', '2019-11-01 18:15:00',
'2019-11-01 18:45:00', '2019-11-01 19:15:00',
'2019-11-01 20:45:00', '2019-11-01 21:15:00',
'2019-11-01 21:45:00', '2019-11-01 22:15:00'],
dtype='datetime64[ns]', name=0, length=3100, freq=None)) to Variable.
我也尝试过使用Xarray.DataArray过程:
Mytime = PD.to_datetime(df[0],infer_datetime_format=True)
Myfreq = np.array([ 0,1,2,3...39])
OutDataArray = Xarray.DataArray(df[np.arange(3,43)], coords=[('time', Mytime ), ('freq', Myfreq ), ('Status', df[1]), ('variable', df[2])])
但这给出了错误:
ValueError: coords is not dict-like, but it has 4 items, which does not match the 2 dimensions of the data
那么,如果该数据帧是2D的,如何将一个熊猫数据帧导入Xarray,但是其中一个维度(即行)实际上由该数据帧的多索引指定的多个维度组成?
根据要求,这是一个重现该问题的示例脚本。请注意,您需要为导入的示例数据的CSV文件设置文件名:
import numpy as np
import pandas as PD
# create some data
dt = PD.date_range(start='01/01/2000 00:00:00', end='02/01/2000 00:00:00', freq='30T')
ldt = len(dt)
vr1 = PD.Series(np.empty(ldt, dtype = np.str))
vr2 = PD.Series(np.empty(ldt, dtype = np.str))
vr3 = PD.Series(np.empty(ldt, dtype = np.str))
vr1.values[:] = 'apple'
vr2.values[:] = 'orange'
vr3.values[:] = 'peach'
# combine the data to mimic my file format
a = PD.Series([1,2,3,4], index=[7,2,8,9])
b = PD.Series([5,6,7,8], index=[7,2,8,9])
df1 = PD.DataFrame({'Time': dt,'Fruit':vr1, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df2 = PD.DataFrame({'Time': dt,'Fruit':vr2, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df3 = PD.DataFrame({'Time': dt,'Fruit':vr3, 'N1': np.random.rand(ldt), 'N2': np.random.rand(ldt), 'N3': np.random.rand(ldt)})
df_unsorted = PD.concat([df1, df2, df3])
df = df_unsorted.sort_values(by=['Time', 'Fruit'])
# write the data to a csv file
filename = 'Give a file path/name here'
df.to_csv(filename, index=False)
#import the csv file and convert to an xarray
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
da = df2.to_xarray()
答案 0 :(得分:1)
您的错误似乎出在csv文件中的列和索引中,而该列和索引未在结果DataFrame中命名。用以下代码替换代码示例的最后两行:
df2 = PD.read_csv(filename, sep=',', skiprows=1, header=None, skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
df2.columns = ['N1', 'N2', 'N3']
df2.index.names = ['time', 'fruit']
ds = df2.to_xarray()
成功转换为xarray数据集。
print(ds)
<xarray.Dataset>
Dimensions: (fruit: 3, time: 1489)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ... 2000-02-01
* fruit (fruit) object 'apple' 'orange' 'peach'
Data variables:
N1 (time, fruit) float64 0.114 0.3726 0.5072 ... 0.2065 0.9082 0.7945
N2 (time, fruit) float64 0.7534 0.1107 0.8866 ... 0.4509 0.5218 0.1472
N3 (time, fruit) float64 0.156 0.6498 0.3521 ... 0.3742 0.5899 0.607
更新:您可以通过删除skiprows=1
中的header=None
和PD.read_csv()
参数,从csv中获取列名称,来跳过手动设置列和索引名称的操作标头。所以您的最后两行如下所示:
df2 = PD.read_csv(filename, sep=',', skipinitialspace=True, index_col=[0,1], parse_dates=True, infer_datetime_format=True )
ds = df2.to_xarray()