Python中pandas数据帧转换的改进

时间:2015-06-11 13:07:14

标签: python pandas

我有以下形式的pandas数据框:

            id2_cond1  id2_cond2  id2_cond3  id2_cond4
id2_cond1   1.000000   0.819689  -0.753702  -0.617213
id2_cond2   0.819689   1.000000  -0.554437  -0.295122
id2_cond3  -0.753702  -0.554437   1.000000   0.939336
id2_cond4  -0.617213  -0.295122   0.939336   1.000000

我想要做的是将数据帧转换为以下形式:

      cond1_cond2 cond1_cond3 cond1_cond4 cond2_cond3 cond2_cond4 cond3_cond4
id2    0.8196886  -0.7537023  -0.6172134   -0.554437  -0.2951216   0.9393364

我可以使用以下脚本正确执行此操作:

df_tmp = pd.DataFrame(index=[identifier], columns=cols)
counter = 0
for x in range(len(df)):
    for y in range(x + 1, len(df)):
        df_tmp.ix[0, counter] = df.ix[x, y]
        counter += 1
print(df_tmp)

这种方法的问题是我必须预定义列,我必须知道顺序。

cols = ["cond1_cond2", "cond1_cond3", "cond1_cond4", "cond2_cond3", "cond2_cond4", "cond3_cond4"]

是否有更好的方法来转换此数据框,自动创建不同的组合?

2 个答案:

答案 0 :(得分:1)

原始DataFrame:

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

首先,让我们删除名称(在此示例中为“id2”):

name = df.index[0].split("_")[0]

然后,让我们得到每个属性的名称。我假设名称也可以包含下划线字符(在此示例中不存在),所以我首先基于下划线进行拆分,将所有元素除去第一个,然后使用下划线:

conds = ["_".join(i.split("_")[1:]) for i in df.index]

现在,让我们使用list comprehension生成所有名称组合:

idx = ['{0}_{1}'.format(conds[i], conds[j]) 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

我们将使用相同的技术来展平数据:

data = [df.iat[i, j] 
        for i in range(len(conds)) 
        for j in range(i + 1, len(conds))]

最后,我们将根据以上信息创建一个系列:

corr_matrix_flat = pd.Series(data, index=idx, name=name)
>>> corr_matrix 
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
Name: id2, dtype: float64

答案 1 :(得分:0)

以下是使用pandas内置函数stack的另一个版本。

import pandas as pd

df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
                   'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
                   'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
                   'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})

df

Series转换为df.stack()
s = df.stack()
print s

输出

id2_cond1  id2_cond1    1.000000
           id2_cond2    0.819689
           id2_cond3   -0.753702
           id2_cond4   -0.617213
id2_cond2  id2_cond1    0.819689
           id2_cond2    1.000000
           id2_cond3   -0.554437
           id2_cond4   -0.295122
id2_cond3  id2_cond1   -0.753702
           id2_cond2   -0.554437
           id2_cond3    1.000000
           id2_cond4    0.939336
id2_cond4  id2_cond1   -0.617213
           id2_cond2   -0.295122
           id2_cond3    0.939336
           id2_cond4    1.000000
dtype: float64

接下来删除对角线和下三角形部分。

    ind_upper = []
    for i in range(len(df)):
        for j in range(len(df)):
...         if i < j:
...             ind_upper.append(True)
...         else:
...             ind_upper.append(False)

s = s[ind_upper]

接下来将索引和列合并为一个。

index = list(s.index)
print index
[('id2_cond1', 'id2_cond2'), ('id2_cond1', 'id2_cond3'), ('id2_cond1', 'id2_cond4'), ('id2_cond2', 'id2_cond3'), ('id2_cond2', 'id2_cond4'), ('id2_cond3', 'id2_cond4')]

index = ['_'.join(id) for id in index]
index = [id.replace('id2_', '') for id in index]
print index
['cond1_cond2', 'cond1_cond3', 'cond1_cond4', 'cond2_cond3', 'cond2_cond4', 'cond3_cond4']

index分配给s

s.index = index
print s
cond1_cond2    0.819689
cond1_cond3   -0.753702
cond1_cond4   -0.617213
cond2_cond3   -0.554437
cond2_cond4   -0.295122
cond3_cond4    0.939336
dtype: float64