检查熊猫列中的刺痛并修改另一个

时间:2019-10-01 11:03:01

标签: python pandas

我正在清洁数据框。数据框包含三列order_id 'order_item''order_type。订单类型可以是:早餐,午餐或晚餐。我想比较订单中的每个项目,以确认它与订单类型匹配。如果没有,我想删除包含错误项目的元组。

菜单如下:

breakfastMenu=['Pancake', 'Coffee', 'Eggs', 'Cereal']
dinnerMenu=['Salmon', 'Fish&Chips', 'Pasta', 'Shrimp']
lunchMenu=['Steak', 'Fries', 'Burger', 'Chicken', 'Salad']

例如,您可以在第一行中看到午餐订单包含咖啡,这是不正确的。 晚餐包括鸡蛋

数据框示例:

    order_id    order_type  order_items
0    ORDB10489  Lunch        [('Coffee', 4), ('Salad', 10), ('Chicken', 8)]
1    ORDZ00319  Dinner       [('Fish&Chips', 9), ('Pasta', 5), ('Eggs', 3)]
2   ORDB00980   Dinner       [('Pasta', 6), ('Fish&Chips', 10)]
3    ORDY10003  Breakfast    [('Coffee', 2), ('Cereal', 1)]
4   ORDK04121   Lunch        [('Steak', 9), ('Chicken', 5)]

我对熊猫数据框没有足够的经验。但是我的想法是用for loop创建一个if conditions。循环会将每个tuple中的第一项与order_type和相应的菜单列表进行比较。如果该项目不在相应的列表中,则将删除元组。

此代码草案只是一个开始,但与我要实现的目标类似:

if dirtyData['order_type'].str.contains('Breakfast').any()\
        and eval(dirtyData['order_items'][0])[0][0] not in breakfastMenu:
            print(dirtyData['order_id']) 

我添加eval来将元组列表从字符串转换为列表。

任何输入表示赞赏 谢谢

4 个答案:

答案 0 :(得分:2)

apply与自定义功能一起使用。

例如:

import ast

breakfastMenu=['Pancake', 'Coffee', 'Eggs', 'Cereal']
dinnerMenu=['Salmon', 'Fish&Chips', 'Pasta', 'Shrimp']
lunchMenu=['Steak', 'Fries', 'Burger', 'Chicken', 'Salad']

check_val = {'Breakfast': breakfastMenu, 'Dinner': dinnerMenu, "Lunch": lunchMenu}

data = [['ORDB10489', 'Lunch', "[('Coffee', 4), ('Salad', 10), ('Chicken', 8)]"],
 ['ORDZ00319', 'Dinner', "[('Fish&Chips', 9), ('Pasta', 5), ('Egg', 3)]"],
 ['ORDB00980', 'Dinner', "[('Pasta', 6), ('Fish&Chips', 10)]"],
 ['ORDY10003', 'Breakfast', "[('Coffee', 2), ('Cereal', 1)]"],
 ['ORDK04121', 'Lunch', "[('Steak', 9), ('Chicken', 5)]"]]

df = pd.DataFrame(data, columns=['order_id', 'order_type', 'order_items'])
df["order_items"] = df["order_items"].apply(ast.literal_eval)
df["order_items"] = df.apply(lambda x: [i for i in x["order_items"] if i[0] in check_val.get(x["order_type"], [])], axis=1)
print(df)

输出:

    order_id order_type                     order_items
0  ORDB10489      Lunch     [(Salad, 10), (Chicken, 8)]
1  ORDZ00319     Dinner   [(Fish&Chips, 9), (Pasta, 5)]
2  ORDB00980     Dinner  [(Pasta, 6), (Fish&Chips, 10)]
3  ORDY10003  Breakfast      [(Coffee, 2), (Cereal, 1)]
4  ORDK04121      Lunch      [(Steak, 9), (Chicken, 5)]

答案 1 :(得分:1)

因此,我认为有一个解决方案,对于循环没有任何必要。仅使用一些联接。但是在实现这一目标之前,我们必须将数据整理成更合适的形状。

flattened_items = df.order_items.apply(pd.Series).stack().reset_index().assign(
    **{"order_item": lambda x:x[0].str[0], "item_count": lambda x:x[0].str[1]})

print(flattened_items.head())
   level_0  level_1                0  order_item  item_count
0        0        0      (Coffee, 4)      Coffee           4
1        0        1      (Salad, 10)       Salad          10
2        0        2     (Chicken, 8)     Chicken           8
3        1        0  (Fish&Chips, 9)  Fish&Chips           9
4        1        1       (Pasta, 5)       Pasta           5

从本质上讲,我只是将元组列表分为两列。请注意,为使设置正常工作,您可能需要在原始Dataframe df上运行reset_index(否则就像是您从Dataframe中获得的示例)

接下来,我们创建一个数据框架,通过该框架将餐点应用到食物

flattend_orders = pd.merge(df[["order_id", "order_type"]], 
         flattened_items[["level_0","order_item", "item_count"]],
left_index=True, right_on="level_0").drop("level_0", axis=1)

meal_dct = {"Breakfast": breakfastMenu, "Lunch": lunchMenu, "Dinner": dinnerMenu}

meal_df = pd.DataFrame.from_dict(meal_dct, orient="index").stack().reset_index(
).drop("level_1", axis=1).rename(columns={"level_0": "Meal", 0: "Item"})

看起来像

print(meal_df.head())
        Meal     Item
0  Breakfast  Pancake
1  Breakfast   Coffee
2  Breakfast     Eggs
3  Breakfast   Cereal
4      Lunch    Steak

现在,我们可以在order_typeorder_item上进行内部联接

merged = pd.merge(flattend_orders, meal_df, left_on=["order_type", "order_item"],
right_on=["Meal", "Item"]).drop(["Meal", "Item"], axis=1)

我们得到

    order_id order_type  order_item  item_count
0  ORDB10489      Lunch       Salad          10
1  ORDB10489      Lunch     Chicken           8
2  ORDK04121      Lunch     Chicken           5
3  ORDZ00319     Dinner  Fish&Chips           9
4  ORDB00980     Dinner  Fish&Chips          10
5  ORDZ00319     Dinner       Pasta           5
6  ORDB00980     Dinner       Pasta           6
7  ORDY10003  Breakfast      Coffee           2
8  ORDY10003  Breakfast      Cereal           1
9  ORDK04121      Lunch       Steak           9

现在,这也许已经足够好了,但是您可能更希望返回一个元组列表。为此:

merged.groupby(["order_id", "order_type"]).apply(lambda x: list(zip(x["order_item"], 
x["item_count"]))).reset_index().rename(columns={0:"order_items"})

给予

    order_id order_type                     order_items
0  ORDB00980     Dinner  [(Fish&Chips, 10), (Pasta, 6)]
1  ORDB10489      Lunch     [(Salad, 10), (Chicken, 8)]
2  ORDK04121      Lunch      [(Chicken, 5), (Steak, 9)]
3  ORDY10003  Breakfast      [(Coffee, 2), (Cereal, 1)]
4  ORDZ00319     Dinner   [(Fish&Chips, 9), (Pasta, 5)]

请注意,此处的丑陋是由于转换了(可能是)不足格式的数据。同样,所有for循环和apples都来自数据转换。

基本上,我的回答可以概括为:

pd.merge(df, df_meal)

如果我们假设正确的数据格式。 顺便说一句,我只是选择item_count作为最佳猜测。

答案 2 :(得分:0)

这可能是您在“应用”功能中要执行的操作。假设breakfastMenudinnerMenulunchMenu是在脚本顶部定义的,则以下功能将起作用:

def check_correct(x):
    if x['order_type'] == 'lunch':
        current_menu = lunchMenu
    elif x['order_type'] == 'dinner':
        current_menu = dinnerMenu
    else:
        current_menu= breakfastMenu

    current_menu = [x.lower() for x in current_menu]

    return_list = []

    for item, _ in x['order_items']:
        return_list.append(item.lower() in current_menu)

    return return_list

您可以使用以下方法在DataFrame中创建新列: df.apply(check_correct, axis = 1)。它将为您提供有关正确与错误数学的列表。第一行将产生以下输出:

[False, True, True]

答案 3 :(得分:0)

for index, row in df.iterrows():
    new = []
    if row["order_type"] == "Breakfast":
        order_type = breakfastMenu
    elif row["order_type"] == "Dinner":
        order_type = dinnerMenu
    elif row["order_type"] == "Lunch":
        order_type = lunchMenu
    else:
        continue

    a = row["order_items"][1:-1]
    b = a.split(",")
    for i in range(0,len(b),2):
        meal = b[i].strip()[2:-1]
        if meal in order_type:
            new.append([meal, b[i+1]])

    row["order_items_new"] = new

print(df["order_items_new"])

0     [[Salad,  10)], [Chicken,  8)]]  
1   [[Fish&Chips,  9)], [Pasta,  5)]]  
2  [[Pasta,  6)], [Fish&Chips,  10)]]  
3      [[Coffee,  2)], [Cereal,  1)]]  
4      [[Steak,  9)], [Chicken,  5)]]  
相关问题