Question

可怕的标题，但是来了。我有一个13,000 x 91数据框。列中的26是数字。这些行是单个项目，项目绩效按年份划分。像这样：

| Year | Control | Description   | USD_Cost | USD_Profit |
|------|---------|---------------|----------|------------|
| 1991 | A1      | A description | 1        | 2          |
| 1992 | A1      | A Description | 100      | 300        |
| 1991 | B1      | B Description | 3        | 50         |
| 1995 | C1      | C Description | 5        | 10         |
| 1990 | D1      | D Description | 2        | 1          |
| 1996 | D1      | D Description | 1        | 1          |

我只想记录每个项目持续了多长时间以及整个项目的绩效，而不是记录每个项目在特定年份的执行情况：

| Years | Control | Description   | USD_Cost | USDProfit |
|-------|---------|---------------|----------|-----------|
| 2     | A1      | A description | 101      | 302       |
| 1     | B1      | B Description | 3        | 50        |
| 1     | C1      | C Description | 5        | 10        |
| 2     | D1      | D Description | 3        | 2         |

Control和Description不变，但是以USD开头的数字列跨行累加。有26个USD列用于不同的性能方面。大约有8000个唯一的控制编号，但共有13000个Year-ControlNumber组合。

我知道如何对一个元素（例如print(dft.groupby(['Control'])['USD_Cost', 'USD_Profit'].sum() )）进行分组，但是当我这样做时，我想我会丢失所有非数字列。此外，我想避免输入全部26个名称美元列。

这可以通过groupby来完成吗？

Answer 1

我认为这应该对您有用

columns = list（filter（lambda x：'USD'in x，df.columns））
df.groupby（[[Control]，'Description']）[columns] .sum（）

这将为您带来按 Control，Description 分组的所有列。这对您的工作没有问题，所以我认为这是最好的方法。

Answer 2

所以我的解决方案是按“控件”分组，然后对每个组应用一个函数，该函数从第一行获取所有非数字数据（我假设非数字数据的所有行都相同），但取所有数字数据的总和。由于您不想将年数相加，因此将单独对待年。

我的代码：

import pandas as pd
import numpy as np


def sum_project(project):
    # Since only numeric data and years are different,
    # we just take the first row
    project_summed = project.iloc[0, :]

    # sum all numeric data but exclude "Year"
    cols_numeric = project.select_dtypes([np.number]).columns
    cols_numeric = cols_numeric.drop(["Year"])
    project_summed[cols_numeric] = project[cols_numeric].sum()

    # Get year number
    project_summed["Year"] = len(project)

    return project_summed


df = pd.DataFrame({
    "Year": [1991, 1992, 1991, 1995, 1990, 1996],
    "Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
    "Description": [
        "A description",
        "A description",
        "B description",
        "C description",
        "D description",
        "D description"
    ],
    "USD_Cost": [1, 100, 3, 5, 2, 1],
    "USD_Profit": [2, 300, 50, 10, 1, 1],
})

findal_df = df.groupby(["Control"]).apply(sum_project)

这给出了final_df：

         Year Control    Description  USD_Cost  USD_Profit
Control                                                   
A1          2      A1  A description       101         302
B1          1      B1  B description         3          50
C1          1      C1  C description         5          10
D1          2      D1  D description         3           2

Answer 3

这是一种非常常见的操作，pandas有一种优雅的操作方法。为了避免繁琐的复制26个求和函数的任务，我使用字典理解。

首先，您要按列定义一个动作字典，然后使用agg函数：

df = pd.DataFrame({
    "Year": [1991, 1992, 1991, 1995, 1990, 1996],
    "Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
    "Description": [
        "A description",
        "A description",
        "B description",
        "C description",
        "D description",
        "D description"
    ],
    "USD_Cost": [1, 100, 3, 5, 2, 1],
    "USD_Profit": [2, 300, 50, 10, 1, 1],
})

actions = {'Year': pd.Series.nunique,
           'Description': lambda x: x.iloc[0]}
actions.update({x: sum for x in df.columns if x.startswith('USD_')})

df.groupby('Control').agg(actions).reset_index()

这提供了

  Control  Year    Description  USD_Cost  USD_Profit
0      A1     2  A description       101         302
1      B1     1  B description         3          50
2      C1     1  C description         5          10
3      D1     2  D description         3           2

数据框Groupby组合多行，汇总浮点型列

3 个答案: