我有以下形式的pandas数据框:
id2_cond1 id2_cond2 id2_cond3 id2_cond4
id2_cond1 1.000000 0.819689 -0.753702 -0.617213
id2_cond2 0.819689 1.000000 -0.554437 -0.295122
id2_cond3 -0.753702 -0.554437 1.000000 0.939336
id2_cond4 -0.617213 -0.295122 0.939336 1.000000
我想要做的是将数据帧转换为以下形式:
cond1_cond2 cond1_cond3 cond1_cond4 cond2_cond3 cond2_cond4 cond3_cond4
id2 0.8196886 -0.7537023 -0.6172134 -0.554437 -0.2951216 0.9393364
我可以使用以下脚本正确执行此操作:
df_tmp = pd.DataFrame(index=[identifier], columns=cols)
counter = 0
for x in range(len(df)):
for y in range(x + 1, len(df)):
df_tmp.ix[0, counter] = df.ix[x, y]
counter += 1
print(df_tmp)
这种方法的问题是我必须预定义列,我必须知道顺序。
cols = ["cond1_cond2", "cond1_cond3", "cond1_cond4", "cond2_cond3", "cond2_cond4", "cond3_cond4"]
是否有更好的方法来转换此数据框,自动创建不同的组合?
答案 0 :(得分:1)
原始DataFrame:
df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})
首先,让我们删除名称(在此示例中为“id2”):
name = df.index[0].split("_")[0]
然后,让我们得到每个属性的名称。我假设名称也可以包含下划线字符(在此示例中不存在),所以我首先基于下划线进行拆分,将所有元素除去第一个,然后使用下划线:
conds = ["_".join(i.split("_")[1:]) for i in df.index]
现在,让我们使用list comprehension生成所有名称组合:
idx = ['{0}_{1}'.format(conds[i], conds[j])
for i in range(len(conds))
for j in range(i + 1, len(conds))]
我们将使用相同的技术来展平数据:
data = [df.iat[i, j]
for i in range(len(conds))
for j in range(i + 1, len(conds))]
最后,我们将根据以上信息创建一个系列:
corr_matrix_flat = pd.Series(data, index=idx, name=name)
>>> corr_matrix
cond1_cond2 0.819689
cond1_cond3 -0.753702
cond1_cond4 -0.617213
cond2_cond3 -0.554437
cond2_cond4 -0.295122
cond3_cond4 0.939336
Name: id2, dtype: float64
答案 1 :(得分:0)
以下是使用pandas
内置函数stack
的另一个版本。
import pandas as pd
df = pd.DataFrame({'id2_cond1': {'id2_cond1': 1.0, 'id2_cond2': 0.81968899999999989, 'id2_cond3': -0.75370200000000009, 'id2_cond4': -0.61721300000000001},
'id2_cond2': {'id2_cond1': 0.81968899999999989, 'id2_cond2': 1.0, 'id2_cond3': -0.55443699999999996, 'id2_cond4': -0.295122},
'id2_cond3': {'id2_cond1': -0.75370200000000009, 'id2_cond2': -0.55443699999999996, 'id2_cond3': 1.0, 'id2_cond4': 0.93933600000000006},
'id2_cond4': {'id2_cond1': -0.61721300000000001, 'id2_cond2': -0.295122, 'id2_cond3': 0.93933600000000006, 'id2_cond4': 1.0}})
df
Series
转换为df.stack()
s = df.stack()
print s
输出
id2_cond1 id2_cond1 1.000000
id2_cond2 0.819689
id2_cond3 -0.753702
id2_cond4 -0.617213
id2_cond2 id2_cond1 0.819689
id2_cond2 1.000000
id2_cond3 -0.554437
id2_cond4 -0.295122
id2_cond3 id2_cond1 -0.753702
id2_cond2 -0.554437
id2_cond3 1.000000
id2_cond4 0.939336
id2_cond4 id2_cond1 -0.617213
id2_cond2 -0.295122
id2_cond3 0.939336
id2_cond4 1.000000
dtype: float64
接下来删除对角线和下三角形部分。
ind_upper = []
for i in range(len(df)):
for j in range(len(df)):
... if i < j:
... ind_upper.append(True)
... else:
... ind_upper.append(False)
s = s[ind_upper]
接下来将索引和列合并为一个。
index = list(s.index)
print index
[('id2_cond1', 'id2_cond2'), ('id2_cond1', 'id2_cond3'), ('id2_cond1', 'id2_cond4'), ('id2_cond2', 'id2_cond3'), ('id2_cond2', 'id2_cond4'), ('id2_cond3', 'id2_cond4')]
index = ['_'.join(id) for id in index]
index = [id.replace('id2_', '') for id in index]
print index
['cond1_cond2', 'cond1_cond3', 'cond1_cond4', 'cond2_cond3', 'cond2_cond4', 'cond3_cond4']
将index
分配给s
s.index = index
print s
cond1_cond2 0.819689
cond1_cond3 -0.753702
cond1_cond4 -0.617213
cond2_cond3 -0.554437
cond2_cond4 -0.295122
cond3_cond4 0.939336
dtype: float64