从xarray

时间:2017-04-27 18:14:54

标签: python csv multidimensional-array python-xarray

我有一些按文件夹组织的实验数据:

/Condition1
    /run1.csv
    /run2.csv
    /run3.csv
    /run4.csv
/Condition2
    /run1.csv
    /run2.csv
    /run3.csv
    /run4.csv

在每个run.csv中都有实验条件和一些统计数据,例如:

param1, param2,  stat1,  stat2
0,      0,       0.1,    0.2

我的代码将遍历树结构并通过连接子文件夹中的所有数据生成dataArrays,然后将子文件夹连接到根文件夹。

下面的代码执行相同的逻辑,但使用硬编码的dataFrame而不是从csv加载。

import pandas as pd
import xarray as xr

condition1 = []
# iterate through csv files in a folder and add to a list of dataframes
condition1.append(pd.DataFrame({'param1': [0], 'param2': [0], 'stat1': [0], 'stat2': [1]}))  # run1
condition1.append(pd.DataFrame({'param1': [0], 'param2': [1], 'stat1': [2], 'stat2': [3]}))  # run2
condition1.append(pd.DataFrame({'param1': [1], 'param2': [0], 'stat1': [4], 'stat2': [5]}))  # run3
condition1.append(pd.DataFrame({'param1': [1], 'param2': [1], 'stat1': [6], 'stat2': [7]}))  # run4

# do the same for experimental condition2
condition2 = []
condition2.append(pd.DataFrame({'param1': [0], 'param2': [0], 'stat1': [0], 'stat2': [1]}))  # run1
condition2.append(pd.DataFrame({'param1': [0], 'param2': [1], 'stat1': [2], 'stat2': [3]}))  # run2
condition2.append(pd.DataFrame({'param1': [1], 'param2': [0], 'stat1': [4], 'stat2': [5]}))  # run3
condition2.append(pd.DataFrame({'param1': [1], 'param2': [1], 'stat1': [6], 'stat2': [7]}))  # run2

# USING PANDAS
con1 = pd.concat(condition1)
con1['run'] = range(0, len(condition1))
con1['condition'] = "cond1"

con2 = pd.concat(condition2)
con2['run'] = range(0, len(condition2))
con2['condition'] = "cond2"

df = pd.concat([con1, con2])
df.index = range(0, len(df))
df.index.name = "record"
df = df.set_index(['param1', 'param2', 'condition', 'run'])

print df, '\n'

print xr.DataArray(df), '\n'

# USING XARRAY

# convert the list to xarray instead of dataframes
condition1_DA = [xr.DataArray(x) for x in condition1]
condition2_DA = [xr.DataArray(x) for x in condition2]  # convert to list of xarrays
# create 2d dataArrays with the 2nd dimension as the run number of the experiment
dataArrayCondition1 = xr.concat(condition1_DA, pd.Index(range(0, len(condition1)), name="runs"))
dataArrayCondition1.name = "condition1"  # usally this would be read from the folder name

dataArrayCondition2 = xr.concat(condition2_DA, pd.Index(range(0, len(condition2)), name="runs"))
dataArrayCondition2.name = "condition2"  # usally this would be read from the folder name

# create 3d data array that concatenates the two experimental conditions along the 3rd dimension
da = xr.concat([dataArrayCondition1, dataArrayCondition2], pd.Index(["cond1", "cond2"], name="conditions"))
da = da.rename({'dim_1': 'fields'})

print da

输出:

                         stat1  stat2
param1 param2 condition run              
0      0      cond1     0        0      1
       1      cond1     1        2      3
1      0      cond1     2        4      5
       1      cond1     3        6      7
0      0      cond2     0        0      1
       1      cond2     1        2      3
1      0      cond2     2        4      5
       1      cond2     3        6      7 

<xarray.DataArray (dim_0: 8, dim_1: 2)>
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])
Coordinates:
  * dim_0      (dim_0) MultiIndex
  - param1     (dim_0) int64 0 0 1 1 0 0 1 1
  - param2     (dim_0) int64 0 1 0 1 0 1 0 1
  - condition  (dim_0) object 'cond1' 'cond1' 'cond1' 'cond1' 'cond2' ...
  - run        (dim_0) int64 0 1 2 3 0 1 2 3
  * dim_1      (dim_1) object 'stat1' 'stat2' 

<xarray.DataArray 'condition1' (conditions: 2, runs: 4, dim_0: 1, fields: 4)>
array([[[[0, 0, 0, 1]],

    [[0, 1, 2, 3]],

    [[1, 0, 4, 5]],

    [[1, 1, 6, 7]]],


       [[[0, 0, 0, 1]],

    [[0, 1, 2, 3]],

    [[1, 0, 4, 5]],

    [[1, 1, 6, 7]]]])
Coordinates:
  * dim_0       (dim_0) int64 0
  * fields      (fields) object 'param1' 'param2' 'stat1' 'stat2'
  * runs        (runs) int64 0 1 2 3
  * conditions  (conditions) object 'cond1' 'cond2'

有没有办法将param1和param2提取为单独的维度?我尝试过使用da.sel()和da.grouby()但没有运气。

理想情况下,输出看起来像:

Coordinates:
  * dim_1       (dim_1) object 'stat1' 'stat2'
  * param1      (param1) int64 0 1
  * param2      (param2) int64 0 1
  * runs        (runs) int64 0 1 2 3
  * conditions  (conditions) object 'cond1' 'cond2'

0 个答案:

没有答案