在pandas DataFrame中,怎样才能平坦化#34;变量是" unflattened"使用索引进入新列?

时间:2017-11-30 18:09:24

标签: python pandas dataframe flatten

我在pandas DataFrames中有一些数据正在使用root_pandas文件中的ROOT进行访问。大多数数据都是可以具有各种值的简单变量。但是,有些变量是数字数组。为了加载这些数组,可以选择展平变量。

因此,例如,数组变量jet_tagWeightBin可以具有不同数量的值,具体取决于物理事件中的喷射数量。当"展平"时,可以使用索引__array_index访问给定物理事件中每个喷气机的各种值。

以下是加载三个物理事件的样子。您可以看到,对于每个物理事件,有一个HT_jets值,但可以使用其索引访问多个jet_tagWeightBin值:

|  |HT_jets|jet_tagWeightBin|__array_index|
|--|-------|----------------|-------------|
|0 |319676 |1               |0            |     |<---------- 1st event
|1 |319676 |5               |1            |     |
|2 |319676 |1               |2            |     |
|3 |319676 |5               |3            |     |
|4 |200476 |5               |0            |        |<------- 2nd event
|5 |200476 |2               |1            |        |
|6 |200476 |1               |2            |        |
|7 |200476 |1               |3            |        |
|8 |520111 |5               |0            |           |<---- 3rd event
|9 |520111 |1               |1            |           |
|10|520111 |2               |2            |           |
|11|520111 |5               |3            |           |
|12|520111 |5               |4            |           |
|13|520111 |2               |5            |           |

以下是代码:

import pandas as pd

df = pd.DataFrame(
         [
             [319676, 1, 0],
             [319676, 5, 1],
             [319676, 1, 2],
             [319676, 5, 3],
             [200476, 5, 0],
             [200476, 2, 1],
             [200476, 1, 2],
             [200476, 1, 3],
             [520111, 5, 0],
             [520111, 1, 1],
             [520111, 2, 2],
             [520111, 5, 3],
             [520111, 5, 4],
             [520111, 2, 5],
         ],
         columns = [
             "HT_jets",
             "jet_tagWeightBin",
             "__array_index"
         ]
    )

现在,我想要做的就是摆脱__array_index并添加一堆新的单值变量,如jet_tagWeightBin_0jet_tagWeightBin_1jet_tagWeightBin_2, ......,最多可达到需要的数量。所以,我想得到这样的东西:

|  |HT_jets|jet_tagWeightBin_0|jet_tagWeightBin_1|jet_tagWeightBin_2|jet_tagWeightBin_3|jet_tagWeightBin_4|jet_tagWeightBin_5|
|--|-------|------------------|------------------|------------------|------------------|------------------|------------------|
|0 |319676 |1                 |5                 |1                 |5                 |NaN               |NaN               |
|1 |200476 |5                 |2                 |1                 |1                 |NaN               |NaN               |
|2 |520111 |5                 |1                 |2                 |5                 |5                 |2                 |

我不确定这种类型的操作是什么,但我确信这必须是一些直截了当的事情。我只是不知道该怎么做。

无论如何,这是尝试的开始:

我可以添加一个具有相应名称的新列:

df["new_name"] = df.apply(lambda row: "jet_tagWeightBin_" + str(row["__array_index"]), axis = 1)

结果如下:

|  |HT_jets|jet_tagWeightBin|__array_index|new_name          |
|--|-------|----------------|-------------|------------------|
|0 |319676 |1               |0            |jet_tagWeightBin_0|
|1 |319676 |5               |1            |jet_tagWeightBin_1|
|2 |319676 |1               |2            |jet_tagWeightBin_2|
|3 |319676 |5               |3            |jet_tagWeightBin_3|
|4 |200476 |5               |0            |jet_tagWeightBin_0|
|5 |200476 |2               |1            |jet_tagWeightBin_1|
|6 |200476 |1               |2            |jet_tagWeightBin_2|
|7 |200476 |1               |3            |jet_tagWeightBin_3|
|8 |520111 |5               |0            |jet_tagWeightBin_0|
|9 |520111 |1               |1            |jet_tagWeightBin_1|
|10|520111 |2               |2            |jet_tagWeightBin_2|
|11|520111 |5               |3            |jet_tagWeightBin_3|
|12|520111 |5               |4            |jet_tagWeightBin_4|
|13|520111 |2               |5            |jet_tagWeightBin_5|

我在哪里。我欢迎指导。 :)

编辑:为清楚起见,我处理了很多变量。以下是数据中的更多列:

|  |eventNumber|Mjj_MindR   |HT_jets|jet_tagWeightBin|__array_index|
|--|-----------|------------|-------|----------------|-------------|
|0 |446427     |98896.421875|319676 |1               |0            |     |<---------- 1st event
|1 |446427     |98896.421875|319676 |5               |1            |     |
|2 |446427     |98896.421875|319676 |1               |2            |     |
|3 |446427     |98896.421875|319676 |5               |3            |     |
|4 |446650     |29691.271484|200476 |5               |0            |        |<------- 2nd event
|5 |446650     |29691.271484|200476 |2               |1            |        |
|6 |446650     |29691.271484|200476 |1               |2            |        |
|7 |446650     |29691.271484|200476 |1               |3            |        |
|8 |446707     |57697.246094|520111 |5               |0            |          |<---- 3rd event
|9 |446707     |57697.246094|520111 |1               |1            |          |
|10|446707     |57697.246094|520111 |2               |2            |          |
|11|446707     |57697.246094|520111 |5               |3            |          |
|12|446707     |57697.246094|520111 |5               |4            |          |
|13|446707     |57697.246094|520111 |2               |5            |          |

1 个答案:

答案 0 :(得分:2)

这是一个支点问题

newDF = df.pivot(columns='array_index', values='jet_tagWeightBin', index='HT_jets')

然后只需重命名列

这给出了:

array_index    0    1    2    3    4    5
HT_jets
200476       5.0  2.0  1.0  1.0  NaN  NaN
319676       1.0  5.0  1.0  5.0  NaN  NaN
520111       5.0  1.0  2.0  5.0  5.0  2.0