我有一些按文件夹组织的实验数据:
/Condition1
/run1.csv
/run2.csv
/run3.csv
/run4.csv
/Condition2
/run1.csv
/run2.csv
/run3.csv
/run4.csv
在每个run.csv中都有实验条件和一些统计数据,例如:
param1, param2, stat1, stat2
0, 0, 0.1, 0.2
我的代码将遍历树结构并通过连接子文件夹中的所有数据生成dataArrays,然后将子文件夹连接到根文件夹。
下面的代码执行相同的逻辑,但使用硬编码的dataFrame而不是从csv加载。
import pandas as pd
import xarray as xr
condition1 = []
# iterate through csv files in a folder and add to a list of dataframes
condition1.append(pd.DataFrame({'param1': [0], 'param2': [0], 'stat1': [0], 'stat2': [1]})) # run1
condition1.append(pd.DataFrame({'param1': [0], 'param2': [1], 'stat1': [2], 'stat2': [3]})) # run2
condition1.append(pd.DataFrame({'param1': [1], 'param2': [0], 'stat1': [4], 'stat2': [5]})) # run3
condition1.append(pd.DataFrame({'param1': [1], 'param2': [1], 'stat1': [6], 'stat2': [7]})) # run4
# do the same for experimental condition2
condition2 = []
condition2.append(pd.DataFrame({'param1': [0], 'param2': [0], 'stat1': [0], 'stat2': [1]})) # run1
condition2.append(pd.DataFrame({'param1': [0], 'param2': [1], 'stat1': [2], 'stat2': [3]})) # run2
condition2.append(pd.DataFrame({'param1': [1], 'param2': [0], 'stat1': [4], 'stat2': [5]})) # run3
condition2.append(pd.DataFrame({'param1': [1], 'param2': [1], 'stat1': [6], 'stat2': [7]})) # run2
# USING PANDAS
con1 = pd.concat(condition1)
con1['run'] = range(0, len(condition1))
con1['condition'] = "cond1"
con2 = pd.concat(condition2)
con2['run'] = range(0, len(condition2))
con2['condition'] = "cond2"
df = pd.concat([con1, con2])
df.index = range(0, len(df))
df.index.name = "record"
df = df.set_index(['param1', 'param2', 'condition', 'run'])
print df, '\n'
print xr.DataArray(df), '\n'
# USING XARRAY
# convert the list to xarray instead of dataframes
condition1_DA = [xr.DataArray(x) for x in condition1]
condition2_DA = [xr.DataArray(x) for x in condition2] # convert to list of xarrays
# create 2d dataArrays with the 2nd dimension as the run number of the experiment
dataArrayCondition1 = xr.concat(condition1_DA, pd.Index(range(0, len(condition1)), name="runs"))
dataArrayCondition1.name = "condition1" # usally this would be read from the folder name
dataArrayCondition2 = xr.concat(condition2_DA, pd.Index(range(0, len(condition2)), name="runs"))
dataArrayCondition2.name = "condition2" # usally this would be read from the folder name
# create 3d data array that concatenates the two experimental conditions along the 3rd dimension
da = xr.concat([dataArrayCondition1, dataArrayCondition2], pd.Index(["cond1", "cond2"], name="conditions"))
da = da.rename({'dim_1': 'fields'})
print da
输出:
stat1 stat2
param1 param2 condition run
0 0 cond1 0 0 1
1 cond1 1 2 3
1 0 cond1 2 4 5
1 cond1 3 6 7
0 0 cond2 0 0 1
1 cond2 1 2 3
1 0 cond2 2 4 5
1 cond2 3 6 7
<xarray.DataArray (dim_0: 8, dim_1: 2)>
array([[0, 1],
[2, 3],
[4, 5],
[6, 7],
[0, 1],
[2, 3],
[4, 5],
[6, 7]])
Coordinates:
* dim_0 (dim_0) MultiIndex
- param1 (dim_0) int64 0 0 1 1 0 0 1 1
- param2 (dim_0) int64 0 1 0 1 0 1 0 1
- condition (dim_0) object 'cond1' 'cond1' 'cond1' 'cond1' 'cond2' ...
- run (dim_0) int64 0 1 2 3 0 1 2 3
* dim_1 (dim_1) object 'stat1' 'stat2'
<xarray.DataArray 'condition1' (conditions: 2, runs: 4, dim_0: 1, fields: 4)>
array([[[[0, 0, 0, 1]],
[[0, 1, 2, 3]],
[[1, 0, 4, 5]],
[[1, 1, 6, 7]]],
[[[0, 0, 0, 1]],
[[0, 1, 2, 3]],
[[1, 0, 4, 5]],
[[1, 1, 6, 7]]]])
Coordinates:
* dim_0 (dim_0) int64 0
* fields (fields) object 'param1' 'param2' 'stat1' 'stat2'
* runs (runs) int64 0 1 2 3
* conditions (conditions) object 'cond1' 'cond2'
有没有办法将param1和param2提取为单独的维度?我尝试过使用da.sel()和da.grouby()但没有运气。
理想情况下,输出看起来像:
Coordinates:
* dim_1 (dim_1) object 'stat1' 'stat2'
* param1 (param1) int64 0 1
* param2 (param2) int64 0 1
* runs (runs) int64 0 1 2 3
* conditions (conditions) object 'cond1' 'cond2'