添加具有键和值的其他变量并将字典附加到数据帧

时间:2021-04-22 09:20:22

标签: python python-3.x pandas dataframe

我需要将数据帧中的列传递给一个函数,该函数作为回报提供一个字典,该字典需要附加到同一个数据帧中的 2 列结果和成本中。

例如函数是:

    def costsplit (acc, srv, owner, cost):
           
        test = splitter().split(acc, srv, owner, cost)
               
        return test

假设测试返回的数据字典类型为 test = {'dps':32, 'dd':21, 'ct':92, 'cc':32}

这意味着当 {'dps':32, 'dd':21, 'ct':92, 'cc':32} 被传递时,acc = dev, srv = instance owner = dpc is cost =30 被测试返回,即下面数据帧的第 1 行,同样的一些其他输出 {'dps':20, 'dd':21, 'ct':92, 'cc':2}acc = prd, srv = instance, owner = abs, cost =35 时被测试返回已通过,即第 4 行,它们将附加到数据框中的结果和成本列中。

当前数据框看起来像:

    date         acc  srv         owner    result         cost
    
    2021-03-01   dev   bucket      dps      gcp.dev.dps       177
    2021-03-01   prd   instance    abs       gcp.prd.abs      35
    2021-03-01   dev   spanner      cc      gcp.dev.cc       98
    2021-03-01   prd   instance        it    gcp.prd.it     135

现在输出数据帧应该附加到字典键值对的 resultcost 列中。

输出应该是这样的:

    date         acc  srv         owner    result         cost
    
    2021-03-01   dev   bucket      dps      gcp.dev.dps       177
    2021-03-01   prd   instance    abs       gcp.prd.abs      35
    2021-03-01   dev   spanner      cc      gcp.dev.cc        98
    2021-03-01   prd   instance    it        gcp.prd.it       135
    2021-03-01                              gcp.dev.dps       32
    2021-03-01                              gcp.dev.dd        21
    2021-03-01                               gcp.dev.ct       92
    2021-03-01                               gcp.dev.cc       32
    2021-03-01                              gcp.prd.dps       20
    2021-03-01                              gcp.prd.dd        21
    2021-03-01                               gcp.prd.ct       92
    2021-03-01                               gcp.prd.cc       2

即循环在当前数据帧的每一行上运行,用于传递给 acc, srv, owner, cost 函数的 costsplit 列数据应附加 gcp.{acc}.{testkey} 部分中的每个 resulttest 值被添加到 cost 返回的 test 列中。

splitter().split 函数根据从数据帧发送的每一行来划分成本并重命名所有者。

使用下面的命令,我只能附加 result 函数,而不是 cost 函数。

    acc['result'] = acc.apply(lambda x: [f'gcp.{acc}.{squ}' for squ, cost in test.items()], axis=1)

2 个答案:

答案 0 :(得分:0)

我不确定,我是否理解正确。

但我想建议一个正在进行中的解决方案。告诉我哪些沟通不清楚,我们会一起解决。 :-)

import pandas as pd

COST_DICT = {
    "bucket": {"dev": {"dps": 177}},
    "spanner": {"dev": {"cc": 98}},
    "instance": {
        "prd": {"dps": 20, "dd": 21, "ct": 92, "cc": 2, "it": 135, "abs": 35},
        "dev": {"dps": 32, "dd": 21, "ct": 92, "cc": 32},
    },
}


def costsplit(acc, srv, owner, previous_cost):
    add_cost = COST_DICT[srv][acc][owner]
    result = f"gcp.{acc}.{owner}"
    cost = previous_cost + add_cost
    return pd.Series({"result": result, "cost": cost})


acc_content = {
    "date": ["2021-03-01", "2021-03-01", "2021-03-01", "2021-03-01"],
    "acc": ["dev", "prd", "dev", "prd"],
    "srv": ["bucket", "instance", "spanner", "instance"],
    "owner": ["dps", "abs", "cc", "it"],
    "prev_cost": [0, 0, 0, 0],
}

acc_first = pd.DataFrame(acc_content)
acc_first[["result", "cost"]] = acc_first.apply(
    lambda row: costsplit(row["acc"], row["srv"], row["owner"], row["prev_cost"]), axis=1
)

#          date  acc       srv owner  prev_cost       result  cost
# 0  2021-03-01  dev    bucket   dps          0  gcp.dev.dps   177
# 1  2021-03-01  prd  instance   abs          0  gcp.prd.abs    35
# 2  2021-03-01  dev   spanner    cc          0   gcp.dev.cc    98
# 3  2021-03-01  prd  instance    it          0   gcp.prd.it   135

我不明白为什么您的输出数据框在 acc, srv, owner 列中为空。你不是说要遍历行,使用这些列创建result并覆盖cost吗? 根据您的解释,我认为最有意义的是:

acc_content = {
    "date": [
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
        "2021-03-01",
    ],
    "acc": ["dev", "dev", "dev", "dev", "prd", "prd", "prd", "prd"],
    "srv": ["instance", "instance", "instance", "instance", "instance", "instance", "instance", "instance"],
    "owner": ["dps", "dd", "ct", "cc", "dps", "dd", "ct", "cc"],
    "prev_cost": [0, 0, 0, 0, 0, 0, 0, 0],
}


acc_second = pd.DataFrame(acc_content)

acc_second[["result", "cost"]] = acc_second.apply(
    lambda row: costsplit(row["acc"], row["srv"], row["owner"], row["prev_cost"]), axis=1
)

#          date  acc       srv owner  prev_cost       result  cost
# 0  2021-03-01  dev  instance   dps          0  gcp.dev.dps    32
# 1  2021-03-01  dev  instance    dd          0   gcp.dev.dd    21
# 2  2021-03-01  dev  instance    ct          0   gcp.dev.ct    92
# 3  2021-03-01  dev  instance    cc          0   gcp.dev.cc    32
# 4  2021-03-01  prd  instance   dps          0  gcp.prd.dps    20
# 5  2021-03-01  prd  instance    dd          0   gcp.prd.dd    21
# 6  2021-03-01  prd  instance    ct          0   gcp.prd.ct    92
# 7  2021-03-01  prd  instance    cc          0   gcp.prd.cc     2

讨论要点:

  1. 您提供的测试指令仅适用于 acc=dev, acc=prd,还有其他选择吗?
  2. test-dicts 仅用于 srv=instance,但您的第一个 df 还包含存储桶和扳手。我已将信息添加到成本字典中,请检查。
  3. 我需要将 absit 的值添加到 COST_DICT["instance"]["prd"],以使其适用于第一个数据帧。
  4. 请再次解释输出是什么。

答案 1 :(得分:-1)

如果我正确理解了您的评论,您需要将 acc 传递到 costsplit 函数中以生成密钥。为此,您可以定义一个新的 costsplit 函数来包装现有函数 -

def new_costsplit (acc, srv, owner, cost):           
    test = costsplit(acc, srv, owner, cost)
    return {f'gcp.{acc}.{k}': v for k, v in test.items()}

并使用这个新函数来获取您的 test_returns -

test_returns = new_costsplit(acc, srv, owner, cost)

然后,您可以将 test_returns 的输出转换为 DataFrame -

import pandas as pd
test_returns = {'dps':32, 'dd':21, 'ct':92, 'cc':32}
test_returns = {f'gcp.dev.{k}': v for k, v in test_returns.items()}
test_returns_df = pd.DataFrame({'result': list(test_returns.keys()), 'cost': list(test_returns.values())})
test_returns_df.index = df.index
test_returns_df
#                     result  cost
#    date                         
#    2021-03-01  gcp.dev.dps    32
#    2021-03-01   gcp.dev.dd    21
#    2021-03-01   gcp.dev.ct    92
#    2021-03-01   gcp.dev.cc    32

然后将其附加到您原来的 DataFrame -

df_new = pd.concat([df, test_returns_df], axis=0)
df_new = df_new.fillna("")
df_new
#                acc cost owner          result       srv
#date                                                    
#2021-03-01      dev   30   dps     gcp.dev.dps    bucket
#2021-03-01      prd   35   abs     gcp.prd.abs  instance
#2021-03-01      dev   98    cc      gcp.dev.cc   spanner
#2021-03-01  sandbox   94    it  gcp.sandbox.it   bigdata
#2021-03-01            32           gcp.dev.dps          
#2021-03-01            21            gcp.dev.dd          
#2021-03-01            92            gcp.dev.ct          
#2021-03-01            32            gcp.dev.cc