Question

我的数据集如下所示：

   Year     Month   Day         Category          Quantity
   1984     1        1          2                   10.5
   1984     1        1          6                   3.7
   1985     1        2          8                   4.8
   1985     2        1          3                   20
   1986     1        1          1                   9
   1986     2        1          18                  12.6
   1987     1        29         20                  2.8

请注意，每年每个月的每一天都包含一个唯一条目。换句话说，每天只能有一个类别（而不是几个）。

我试图计算每个类别每年发生的次数。

然而，在Pandas中使用count我意识到零计数不包括在内。换句话说，如果一个类别没有在一年内发生，则不包括在内。所以为了解决我尝试使用：fill_value=0（如下面的代码所示）。

我最终得到了这个（警告：不要按原样运行此代码，因为它显然会占用所有内存）：

 import pandas as pd

    df = pd.read_csv("import.csv", header=0,
                        encoding='iso-8859-1')


    midx = pd.MultiIndex.from_product([
            df['Year'],
            df['Category']
            ], names=['Year', 'Category'])


    df['QuantityWithNaN'] = pd.to_numeric(df['Quantity'], errors='coerce')

    count_quantity_yearly_above_5 = df[df['QuantityWithNaN'] > 5.0].groupby(['Year', 'Category'])['Quantity'].count()

    count_quantity_yearly_above_5.reindex(midx, fill_value=0)

    df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1) 

df.to_csv("export.csv",encoding='iso-8859-1')

数据帧df的数据类型是在运行该代码之后：

The datatypes for the dataframe that is imported from the CSV is as:

Year                int64
Month               int64
Day                 int64
Category            int64
Quantity            object
QuantityWithNaN     float64

最终结果应该是这样的，但不能用上面的代码实现。（最终结果不能按任何特定顺序排序，唯一重要的是每年都会出现所有类别）：

Year        Month   Day     Category   Quantity count_quantity_yearly_above_5

   1984     1       1           1           10.5                        2
   1984     1       1           2           3.7                         7
   1984     1       2           3           4.8                         1
   1985     2       1           1           20                          9
   1985     1       1           2           9                           1
   1986     2       1           3           12.6                        4
   1987     1       29          20          2.8                         5
   1988                         10           2                          0

同样对于可视化和最终重要的信息将完全由列给出如下，所以包括零计数，并且每年和类别的组合没有重复的行（显然我懒惰，每年在这里包括所有类别（1-20）需要更多的空间）：

Year                    Category                count_quantity_yearly_above_5

   1984                     1                                               2
   1984                     2                                               7
   1984                     3                                               1
   1985                     1                                               9
   1985                     2                                               1
   1986                     3                                               4
   1987                     20                                              5
   1988                     13                                              0

现在我最终得到了一个单独的series object（count_quantity_yearly_above_5），我想将其插入原始的dataframe df中。

使用reindex我希望减少行数，以便每个年份和类别的唯一组合只有一行，这意味着年份和类别的每个组合只出现一次（换句话说，每个年份每个类别仅代表一次）。

显然fill_value=0应该告诉大熊猫count包含零点数。

显然代码出了问题，因为运行它时会占用所有内存，我怀疑是由于代码中的其中一行：

count_quantity_yearly_above_5.reindex(midx, fill_value=0)

df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)

修改

主要问题是我无法将count_quantity_yearly_above_5 - 列添加到原始数据框中，这可能与count_quantity_yearly_above_5是一个系列对象这一事实有关。现在我显然没有正确地将系列对象导入原始数据帧。关于如何调整此代码的任何建议？

仅运行该行（df['count_quantity_yearly_above_5'] = df.apply(count_quantity_yearly_above_5,axis=1)）将返回错误：

TypeError: ("'Series' object is not callable", 'occurred at index 0')

编辑2

我只知道哪一行导致100％的内存使用量：

count_quantity_yearly_above_5.reindex(midx, fill_value=0)

Answer 1

您可能想要使用groupby。

以下将返回您想要的决赛桌。仅包含Year，Category和count_quantity_yearly_above_5列。

df.groupby(['Year', 'Category']).size().reset_index(name='count_quantity_yearly_above_5')

Here is a simlar problem with more detailed answers

如何在Pandas计数中包含零值并将结果与原始数据帧合并

1 个答案:

如何在Pandas计数中包含零值并将结果与​​原始数据帧合并

1 个答案:

如何在Pandas计数中包含零值并将结果与原始数据帧合并