分组和汇总熊猫DataFrame以获取摘要DataFrame

时间:2019-10-16 18:24:01

标签: python pandas dataframe pandas-groupby

我有以下详细的DataFrame:

来源:

df_detailed = pd.DataFrame([
    ["Fail", "P1", "3 Failed Partition","X001, X002, X003"],
    ["Fail","P1","Late Backup","Late Backup"],
    ["Fail","P1","2 Failed Partition","X001, X002"],
    ["Fail","P2","2 Failed Partition","X001, X002"],
    ["Fail","P2","Late Backup","Late Backup"],
    ["Warn","P2","Huge Size","1GB"],
    ["Warn","P2","Huge Size","2GB"]
], columns = ["Severity", "Partition", "Status", "Comment"])

输出:

  Severity Partition              Status           Comment
0     Fail        P1  3 Failed Partition  X001, X002, X003
1     Fail        P1         Late Backup       Late Backup
2     Fail        P1  2 Failed Partition        X001, X002
3     Fail        P2  2 Failed Partition        X001, X002
4     Fail        P2         Late Backup       Late Backup
5     Warn        P2           Huge Size               1GB
6     Warn        P2           Huge Size               2GB

我想对此进行分组和汇总,并得到以下结果:

结果:

  Partition                                     Status
0        P1          3 Failed Partition, 2 Late Backup
1        P2  2 Failed Partition, 1 Late Backup, 2 Warn

注意:

  1. 关键字“ Late Backup”,“ Failed Partition”,“ Huge Size”是静态的,不会更改。

  2. 所有严重性为“ Fail”的数据在摘要DataFrame中都应包含详细信息。

  3. 所有其他严重性,例如“警告”,“信息” ...等应仅包含严重性计数,如预期结果示例中所示

  4. “详细数据框”中的“
  5. 失败分区”以失败计数作为前缀,但是在“摘要”中,每个分区(即P1,P2)的唯一值计数出现在摘要DataFrame中

有人可以帮忙吗,我已经两天没睡了:(

1 个答案:

答案 0 :(得分:1)

感谢您的有趣任务,问题已解决,请在下面找到解决方案并关注评论,随时提出问题。

import pandas as pd
from collections import Counter

df_detailed = pd.DataFrame([
    ["Fail", "P1", "3 Failed Partition", "X001, X002, X003"],
    ["Fail", "P1", "Late Backup", "Late Backup"],
    ["Fail", "P1", "2 Failed Partition", "X001, X002"],
    ["Fail", "P2", "2 Failed Partition", "X001, X002"],
    ["Fail", "P2", "Late Backup", "Late Backup"],
    ["Warn", "P2", "Huge Size", "1GB"],
    ["Warn", "P2", "Huge Size", "2GB"]
], columns=["Severity", "Partition", "Status", "Comment"])


def change_warn(severity, status):
    """To create a new column where we remove real Status with just Warn message"""
    if severity == "Warn":
        return "Warn"
    else:
        return status


df_detailed["Status"] = df_detailed.apply(lambda row: change_warn(row["Severity"], row["Status"]), axis=1)


def remove_leading_digits(x):
    if x[0].isdigit():
        x = " ".join(x.split(" ")[1:])
    return x


df_detailed["Status"] = df_detailed["Status"].apply(lambda x: remove_leading_digits(x))

df_detailed["Comment"] = df_detailed["Comment"].apply(lambda x: x + ",")  # we need it since we will sum the columns then

# need to combine to distinguish P1 from P2:
df_detailed["TempStatus"] = df_detailed["Partition"] + " " + df_detailed["Status"]

gr_b = df_detailed[["Partition", "TempStatus", "Comment"]].groupby("TempStatus").sum()


def calculate_unique_comment(status, comment):
    comments = []
    if status.endswith("Failed Partition"):
        for c in comment.split(","):
            if c != "":
                comments.append(c.strip())
        counter = Counter(comments)
        return str(len(counter.keys()))
    else:
        return str(0)


del gr_b["Partition"]  # do not need it

gr_b = gr_b.reset_index()  # otherwise get problem

gr_b["CountUnCom"] = gr_b.apply(lambda row: calculate_unique_comment(row["TempStatus"], row["Comment"]), axis=1)

# let's find of unique comments per Partion for Failed partition and put them in dict
part_dict = {}
for i in range(len(gr_b)):
    if gr_b["TempStatus"][i].endswith("Failed Partition"):
        part_dict[gr_b["TempStatus"][i]] = gr_b["CountUnCom"][i]


# let's take only what we need to work with
df_small = pd.DataFrame(df_detailed[["Partition", "Status"]])

df_small["Status"] = df_small["Status"].apply(lambda x: x + ",")  # to sum and split later

gr_df_small = df_small.groupby("Partition").sum()

gr_df_small = gr_df_small.reset_index()


def convert_status_to_list(status):
    new_status = []
    for c in status.split(","):
        if c != "":
            new_status.append(c.strip())
    return new_status


gr_df_small["Status"] = gr_df_small["Status"].apply(lambda x: convert_status_to_list(x))


def calculate_status(partition, status, x):
    result = []
    for k, v in Counter(status).items():
        if k == "Failed Partition":
            v = x[partition + " " + "Failed Partition"]
        result.append(f"{v} {k}")
    return " ".join(result)


gr_df_small["Status"] = gr_df_small.apply(lambda row: calculate_status(row["Partition"], row["Status"], part_dict),  axis=1)


print(gr_df_small)

输出:

  Partition                                   Status
0        P1         3 Failed Partition 1 Late Backup
1        P2  2 Failed Partition 1 Late Backup 2 Warn