操作熊猫数据框以组合类似的列

时间:2021-03-29 23:31:59

标签: python pandas

我有一个非常大的 Pandas 数据框,看起来像这样我用于关系表(有几千行,每个 trail_id 的活动数量各不相同):

table_1

我希望它像这样形成:

table 2

我已经尝试了这两种方法,但似乎不起作用:

pd.melt(df)
df.stack().reset_index()

任何帮助将不胜感激!

3 个答案:

答案 0 :(得分:3)

您需要将 id_vars 传递给 .melt() 才能获得您想要的输出。

>>> df.melt(id_vars='trail_id', value_name='activity_id').drop(columns='variable')
    trail_id  activity_id
0          1            1
1          2            1
2          3            1
3          4            3
4          5            2
5          1            2
6          2            2
7          3            2
8          4            4
9          5            5
10         1            3
11         2            4
12         3            6
13         4            7
14         5            9

答案 1 :(得分:2)

您可以执行以下代码:

import pandas as pd
# Initializing
dataframe1 = pd.DataFrame({'trail_id':[1,2,3,4,5],
                           'activity_1':[1,1,1,3,2],
                           'activity_2':[2,2,2,4,5],
                           'activity_3':[3,4,6,7,9]})
dictionary = dataframe1.to_dict()

# Create the final dictionary to put the values in
main_dict = {"trail_id":[], "activity_id":[]}
for key,value in dictionary.items():
    if(key == "trail_id"):
        continue
    else:
        main_dict["trail_id"] += list(dictionary["trail_id"].values())
        main_dict["activity_id"] += list(value.values())

# Dropping the index is not necessary but it helps to have a cleaner output
last_dataframe = pd.DataFrame(data=main_dict).sort_values(by = ["trail_id"]).reset_index(drop=True)
print(last_dataframe)

输出

    trail_id  activity_id
0          1            1
1          1            2
2          1            3
3          2            1
4          2            2
5          2            4
6          3            1
7          3            2
8          3            6
9          4            3
10         4            4
11         4            7
12         5            2
13         5            5
14         5            9

答案 2 :(得分:1)

df = (
    pd.concat(
        [
            df["trail_id"],
            df.loc[:, "activity_1":"activity_3"].apply(list, axis=1),
        ],
        axis=1,
    )
    .explode(0)
    .rename(columns={0: "activity_id"})
)
print(df)

打印:

   trail_id activity_id
0         1           1
0         1           2
0         1           3
1         2           1
1         2           2
1         2           4
2         3           1
2         3           2
2         3           6
3         4           3
3         4           4
3         4           7
4         5           2
4         5           5
4         5           9
相关问题