我在pandas DataFrames中有一些数据正在使用root_pandas文件中的ROOT进行访问。大多数数据都是可以具有各种值的简单变量。但是,有些变量是数字数组。为了加载这些数组,可以选择展平变量。
因此,例如,数组变量jet_tagWeightBin
可以具有不同数量的值,具体取决于物理事件中的喷射数量。当"展平"时,可以使用索引__array_index
访问给定物理事件中每个喷气机的各种值。
以下是加载三个物理事件的样子。您可以看到,对于每个物理事件,有一个HT_jets
值,但可以使用其索引访问多个jet_tagWeightBin
值:
| |HT_jets|jet_tagWeightBin|__array_index|
|--|-------|----------------|-------------|
|0 |319676 |1 |0 | |<---------- 1st event
|1 |319676 |5 |1 | |
|2 |319676 |1 |2 | |
|3 |319676 |5 |3 | |
|4 |200476 |5 |0 | |<------- 2nd event
|5 |200476 |2 |1 | |
|6 |200476 |1 |2 | |
|7 |200476 |1 |3 | |
|8 |520111 |5 |0 | |<---- 3rd event
|9 |520111 |1 |1 | |
|10|520111 |2 |2 | |
|11|520111 |5 |3 | |
|12|520111 |5 |4 | |
|13|520111 |2 |5 | |
以下是代码:
import pandas as pd
df = pd.DataFrame(
[
[319676, 1, 0],
[319676, 5, 1],
[319676, 1, 2],
[319676, 5, 3],
[200476, 5, 0],
[200476, 2, 1],
[200476, 1, 2],
[200476, 1, 3],
[520111, 5, 0],
[520111, 1, 1],
[520111, 2, 2],
[520111, 5, 3],
[520111, 5, 4],
[520111, 2, 5],
],
columns = [
"HT_jets",
"jet_tagWeightBin",
"__array_index"
]
)
现在,我想要做的就是摆脱__array_index
并添加一堆新的单值变量,如jet_tagWeightBin_0
,jet_tagWeightBin_1
,jet_tagWeightBin_2
, ......,最多可达到需要的数量。所以,我想得到这样的东西:
| |HT_jets|jet_tagWeightBin_0|jet_tagWeightBin_1|jet_tagWeightBin_2|jet_tagWeightBin_3|jet_tagWeightBin_4|jet_tagWeightBin_5|
|--|-------|------------------|------------------|------------------|------------------|------------------|------------------|
|0 |319676 |1 |5 |1 |5 |NaN |NaN |
|1 |200476 |5 |2 |1 |1 |NaN |NaN |
|2 |520111 |5 |1 |2 |5 |5 |2 |
我不确定这种类型的操作是什么,但我确信这必须是一些直截了当的事情。我只是不知道该怎么做。
无论如何,这是尝试的开始:
我可以添加一个具有相应名称的新列:
df["new_name"] = df.apply(lambda row: "jet_tagWeightBin_" + str(row["__array_index"]), axis = 1)
结果如下:
| |HT_jets|jet_tagWeightBin|__array_index|new_name |
|--|-------|----------------|-------------|------------------|
|0 |319676 |1 |0 |jet_tagWeightBin_0|
|1 |319676 |5 |1 |jet_tagWeightBin_1|
|2 |319676 |1 |2 |jet_tagWeightBin_2|
|3 |319676 |5 |3 |jet_tagWeightBin_3|
|4 |200476 |5 |0 |jet_tagWeightBin_0|
|5 |200476 |2 |1 |jet_tagWeightBin_1|
|6 |200476 |1 |2 |jet_tagWeightBin_2|
|7 |200476 |1 |3 |jet_tagWeightBin_3|
|8 |520111 |5 |0 |jet_tagWeightBin_0|
|9 |520111 |1 |1 |jet_tagWeightBin_1|
|10|520111 |2 |2 |jet_tagWeightBin_2|
|11|520111 |5 |3 |jet_tagWeightBin_3|
|12|520111 |5 |4 |jet_tagWeightBin_4|
|13|520111 |2 |5 |jet_tagWeightBin_5|
我在哪里。我欢迎指导。 :)
编辑:为清楚起见,我处理了很多变量。以下是数据中的更多列:
| |eventNumber|Mjj_MindR |HT_jets|jet_tagWeightBin|__array_index|
|--|-----------|------------|-------|----------------|-------------|
|0 |446427 |98896.421875|319676 |1 |0 | |<---------- 1st event
|1 |446427 |98896.421875|319676 |5 |1 | |
|2 |446427 |98896.421875|319676 |1 |2 | |
|3 |446427 |98896.421875|319676 |5 |3 | |
|4 |446650 |29691.271484|200476 |5 |0 | |<------- 2nd event
|5 |446650 |29691.271484|200476 |2 |1 | |
|6 |446650 |29691.271484|200476 |1 |2 | |
|7 |446650 |29691.271484|200476 |1 |3 | |
|8 |446707 |57697.246094|520111 |5 |0 | |<---- 3rd event
|9 |446707 |57697.246094|520111 |1 |1 | |
|10|446707 |57697.246094|520111 |2 |2 | |
|11|446707 |57697.246094|520111 |5 |3 | |
|12|446707 |57697.246094|520111 |5 |4 | |
|13|446707 |57697.246094|520111 |2 |5 | |
答案 0 :(得分:2)
这是一个支点问题
newDF = df.pivot(columns='array_index', values='jet_tagWeightBin', index='HT_jets')
然后只需重命名列
这给出了:
array_index 0 1 2 3 4 5
HT_jets
200476 5.0 2.0 1.0 1.0 NaN NaN
319676 1.0 5.0 1.0 5.0 NaN NaN
520111 5.0 1.0 2.0 5.0 5.0 2.0