如何对一列中的值求和,取决于其他列中的项?

时间:2018-06-20 15:12:43

标签: python pandas

我有以下数据框:

    Course  Orders Ingredient 1 Ingredient 2  Ingredient 3
    starter 3      Fish         Bread         Mayonnaise
    starter 1      Olives       Bread   
    starter 5      Hummus       Pita    
    main    1      Pizza        
    main    6      Beef         Potato        Peas
    main    9      Fish         Peas    
    main    11     Bread        Mayonnaise    Beef
    main    4      Pasta        Bolognese     Peas
    desert  10     Cheese       Olives        Crackers
    desert  7      Cookies      Cream   
    desert  8      Cheesecake   Cream   

我想总结每道菜每种配料的订购数量。成分所在的列并不重要。

以下数据框是我希望输出的内容:

Course  Ord Ing1       IngOrd1 Ing2     IngOrd2 Ing3 IngOrd3
starter 3   Fish       3       Bread    4       Mayo     3
starter 1   Olives     1       Bread    4       
starter 5   Hummus     5       Pita     5       
main    1   Pizza      1                
main    6   Beef       17      Potato   6       Peas     21
main    9   Fish       9       Peas     21      
main    11  Bread      11      Mayo     11      Beef     17
main    4   Pasta      4       Bolognese 4      Peas     21
desert  10  Cheese     10      Olives   10      Crackers 10
desert  7   Cookies    7       Cream    15      
desert  8   Cheesecake 8       Cream    15      

我尝试使用groupby()。sum(),但这不适用于3列中的成分。

我也不能使用查找,因为在整个数据框中有一些实例,我不知道我要寻找什么成分。

1 个答案:

答案 0 :(得分:0)

我不相信使用groupby或其他类似的大熊猫方法可以做到这一点,尽管我很高兴被证明是错误的。无论如何,以下内容并不是特别漂亮,但是它将为您提供所要追求的。

import pandas as pd
from collections import defaultdict

# The data you provided
df = pd.read_csv('orders.csv')

# Group these labels for convenience
ingredients = ['Ingredient 1', 'Ingredient 2', 'Ingredient 3']
orders = ['IngOrd1', 'IngOrd2', 'IngOrd3']

# Interleave the two lists for final data frame
combined = [y for x in zip(ingredients, orders) for y in x]

# Restructure the data frame so we can group on ingredients
melted = pd.melt(df, id_vars=['Course', 'Orders'], value_vars=ingredients, value_name='Ingredient')

# This is a map that we can apply to each ingredient column to
# look up the correct order count
maps = defaultdict(lambda: defaultdict(int))

# Build the map. Every course/ingredient pair is keyed to the total
# count for that pair, e.g. {(main, beef): 17, ...}
for index, group in melted.groupby(['Course', 'Ingredient']):
    course, ingredient = index
    maps[course][ingredient] += group.Orders.sum()

# Now apply the map to each ingredient column of the data frame
# to create the new count columns
for i, o in zip(ingredients, orders):
    df[o] = df.apply(lambda x: maps[x.Course][x[i]], axis=1)

# Adjust the columns labels
df = df[['Course', 'Orders'] + combined]

print df

     Course  Orders Ingredient 1  IngOrd1 Ingredient 2  IngOrd2 Ingredient 3  IngOrd3
0   starter       3         Fish        3        Bread        4   Mayonnaise        3
1   starter       1       Olives        1        Bread        4          NaN        0
2   starter       5       Hummus        5         Pita        5          NaN        0
3      main       1        Pizza        1          NaN        0          NaN        0
4      main       6         Beef       17       Potato        6         Peas       19
5      main       9         Fish        9         Peas       19          NaN        0
6      main      11        Bread       11   Mayonnaise       11         Beef       17
7      main       4        Pasta        4    Bolognese        4         Peas       19
8    desert      10       Cheese       10       Olives       10     Crackers       10
9    desert       7      Cookies        7        Cream       15          NaN        0
10   desert       8   Cheesecake        8        Cream       15          NaN        0

如果这是一个问题,则需要处理NaN和0计数。但这是一项琐碎的任务。