大熊猫:融化具有相同索引的多列

时间:2020-06-05 10:39:46

标签: python pandas

我有以下熊猫数据框:

+---+-----+-----+------+------+------+------+
|   |  A  |  B  | C_10 | C_20 | D_10 | D_20 |
+---+-----+-----+------+------+------+------+
| 1 | 0.1 | 0.2 |    1 |    2 |    3 |    4 |
| 2 | 0.3 | 0.4 |    5 |    6 |    7 |    8 |
+---+-----+-----+------+------+------+------+

现在,我想熔化列C_10C_20D_10D_20以获取如下所示的数据框:

+---+-----+-----+----+---+---+
|   |  A  |  B  | N  | C | D |
+---+-----+-----+----+---+---+
| 1 | 0.1 | 0.2 | 10 | 1 | 3 |
| 1 | 0.1 | 0.2 | 20 | 2 | 4 |
| 2 | 0.3 | 0.4 | 10 | 5 | 7 |
| 2 | 0.3 | 0.4 | 20 | 6 | 8 |
+---+-----+-----+----+---+---+

有没有简单的方法可以做到这一点?谢谢!

编辑:我尝试过wide_to_long,但是如果数据框中存在重复的行,则此操作不起作用:

df = pd.DataFrame({
    'combination': [1, 1, 2, 2],
    'A': [0.1, 0.1, 0.2, 0.2],
    'B': [0.3, 0.3, 0.4, 0.4],
    'C_10': [1, 5, 6, 7],
    'C_20': [2, 6, 7, 8],
    'D_10': [3, 7, 8, 9],
    'D_20': [4, 8, 9, 10],
})
+--------------------------------------------------+
|    combination    A    B  C_10  C_20  D_10  D_20 |
+--------------------------------------------------+
| 0            1  0.1  0.3     1     2     3     4 |
| 1            1  0.1  0.3     5     6     7     8 |
| 2            2  0.2  0.4     6     7     8     9 |
| 3            2  0.2  0.4     7     8     9    10 |
+--------------------------------------------------+

如果我使用wide_to_long,则会出现以下错误:

> pd.wide_to_long(df, stubnames=['C','D'], i=['combination', 'A', 'B'], j='N', sep='_').reset_index()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-cc5863fa7ecc> in <module>
----> 1 pd.wide_to_long(df, stubnames=['C','D'], i=['combination', 'A', 'B'], j='N', sep='_').reset_index()

pandas/core/reshape/melt.py in wide_to_long(df, stubnames, i, j, sep, suffix)
    456 
    457     if df[i].duplicated().any():
--> 458         raise ValueError("the id variables need to uniquely identify each row")
    459 
    460     value_vars = [get_var_names(df, stub, sep, suffix) for stub in stubnames]

ValueError: the id variables need to uniquely identify each row

参数i被描述为“用作ID变量的列。”,但我不明白这到底是什么意思。

1 个答案:

答案 0 :(得分:2)

使用wide_to_long

df = pd.wide_to_long(df, stubnames=['C','D'], i=['A','B'], j='N', sep='_').reset_index()
print (df)
     A    B   N  C  D
0  0.1  0.2  10  1  3
1  0.1  0.2  20  2  4
2  0.3  0.4  10  5  7
3  0.3  0.4  20  6  8

编辑:如果A, B列的可能组合不是唯一的,则可以创建将索引转换为列index的帮助器列,应用解决方案并最后删除级别index

df = (pd.wide_to_long(df.reset_index(), 
                      stubnames=['C','D'],
                      i=['index','A','B'], 
                      j='N', 
                      sep='_')
        .reset_index(level=0, drop=True)
        .reset_index())
print (df)

     A    B   N  combination  C   D
0  0.1  0.3  10            1  1   3
1  0.1  0.3  20            1  2   4
2  0.1  0.3  10            1  5   7
3  0.1  0.3  20            1  6   8
4  0.2  0.4  10            2  6   8
5  0.2  0.4  20            2  7   9
6  0.2  0.4  10            2  7   9
7  0.2  0.4  20            2  8  10