我正在清洁数据框。数据框包含三列order_id
'order_item'
和'order_type
。订单类型可以是:早餐,午餐或晚餐。我想比较订单中的每个项目,以确认它与订单类型匹配。如果没有,我想删除包含错误项目的元组。
菜单如下:
breakfastMenu=['Pancake', 'Coffee', 'Eggs', 'Cereal']
dinnerMenu=['Salmon', 'Fish&Chips', 'Pasta', 'Shrimp']
lunchMenu=['Steak', 'Fries', 'Burger', 'Chicken', 'Salad']
例如,您可以在第一行中看到午餐订单包含咖啡,这是不正确的。 晚餐包括鸡蛋。
数据框示例:
order_id order_type order_items
0 ORDB10489 Lunch [('Coffee', 4), ('Salad', 10), ('Chicken', 8)]
1 ORDZ00319 Dinner [('Fish&Chips', 9), ('Pasta', 5), ('Eggs', 3)]
2 ORDB00980 Dinner [('Pasta', 6), ('Fish&Chips', 10)]
3 ORDY10003 Breakfast [('Coffee', 2), ('Cereal', 1)]
4 ORDK04121 Lunch [('Steak', 9), ('Chicken', 5)]
我对熊猫数据框没有足够的经验。但是我的想法是用for loop
创建一个if conditions
。循环会将每个tuple
中的第一项与order_type
和相应的菜单列表进行比较。如果该项目不在相应的列表中,则将删除元组。
此代码草案只是一个开始,但与我要实现的目标类似:
if dirtyData['order_type'].str.contains('Breakfast').any()\
and eval(dirtyData['order_items'][0])[0][0] not in breakfastMenu:
print(dirtyData['order_id'])
我添加eval
来将元组列表从字符串转换为列表。
任何输入表示赞赏 谢谢
答案 0 :(得分:2)
将apply
与自定义功能一起使用。
例如:
import ast
breakfastMenu=['Pancake', 'Coffee', 'Eggs', 'Cereal']
dinnerMenu=['Salmon', 'Fish&Chips', 'Pasta', 'Shrimp']
lunchMenu=['Steak', 'Fries', 'Burger', 'Chicken', 'Salad']
check_val = {'Breakfast': breakfastMenu, 'Dinner': dinnerMenu, "Lunch": lunchMenu}
data = [['ORDB10489', 'Lunch', "[('Coffee', 4), ('Salad', 10), ('Chicken', 8)]"],
['ORDZ00319', 'Dinner', "[('Fish&Chips', 9), ('Pasta', 5), ('Egg', 3)]"],
['ORDB00980', 'Dinner', "[('Pasta', 6), ('Fish&Chips', 10)]"],
['ORDY10003', 'Breakfast', "[('Coffee', 2), ('Cereal', 1)]"],
['ORDK04121', 'Lunch', "[('Steak', 9), ('Chicken', 5)]"]]
df = pd.DataFrame(data, columns=['order_id', 'order_type', 'order_items'])
df["order_items"] = df["order_items"].apply(ast.literal_eval)
df["order_items"] = df.apply(lambda x: [i for i in x["order_items"] if i[0] in check_val.get(x["order_type"], [])], axis=1)
print(df)
输出:
order_id order_type order_items
0 ORDB10489 Lunch [(Salad, 10), (Chicken, 8)]
1 ORDZ00319 Dinner [(Fish&Chips, 9), (Pasta, 5)]
2 ORDB00980 Dinner [(Pasta, 6), (Fish&Chips, 10)]
3 ORDY10003 Breakfast [(Coffee, 2), (Cereal, 1)]
4 ORDK04121 Lunch [(Steak, 9), (Chicken, 5)]
答案 1 :(得分:1)
因此,我认为有一个解决方案,对于循环没有任何必要。仅使用一些联接。但是在实现这一目标之前,我们必须将数据整理成更合适的形状。
flattened_items = df.order_items.apply(pd.Series).stack().reset_index().assign(
**{"order_item": lambda x:x[0].str[0], "item_count": lambda x:x[0].str[1]})
print(flattened_items.head())
level_0 level_1 0 order_item item_count
0 0 0 (Coffee, 4) Coffee 4
1 0 1 (Salad, 10) Salad 10
2 0 2 (Chicken, 8) Chicken 8
3 1 0 (Fish&Chips, 9) Fish&Chips 9
4 1 1 (Pasta, 5) Pasta 5
从本质上讲,我只是将元组列表分为两列。请注意,为使设置正常工作,您可能需要在原始Dataframe df上运行reset_index
(否则就像是您从Dataframe中获得的示例)
接下来,我们创建一个数据框架,通过该框架将餐点应用到食物
flattend_orders = pd.merge(df[["order_id", "order_type"]],
flattened_items[["level_0","order_item", "item_count"]],
left_index=True, right_on="level_0").drop("level_0", axis=1)
meal_dct = {"Breakfast": breakfastMenu, "Lunch": lunchMenu, "Dinner": dinnerMenu}
meal_df = pd.DataFrame.from_dict(meal_dct, orient="index").stack().reset_index(
).drop("level_1", axis=1).rename(columns={"level_0": "Meal", 0: "Item"})
看起来像
print(meal_df.head())
Meal Item
0 Breakfast Pancake
1 Breakfast Coffee
2 Breakfast Eggs
3 Breakfast Cereal
4 Lunch Steak
现在,我们可以在order_type
和order_item
上进行内部联接
merged = pd.merge(flattend_orders, meal_df, left_on=["order_type", "order_item"],
right_on=["Meal", "Item"]).drop(["Meal", "Item"], axis=1)
我们得到
order_id order_type order_item item_count
0 ORDB10489 Lunch Salad 10
1 ORDB10489 Lunch Chicken 8
2 ORDK04121 Lunch Chicken 5
3 ORDZ00319 Dinner Fish&Chips 9
4 ORDB00980 Dinner Fish&Chips 10
5 ORDZ00319 Dinner Pasta 5
6 ORDB00980 Dinner Pasta 6
7 ORDY10003 Breakfast Coffee 2
8 ORDY10003 Breakfast Cereal 1
9 ORDK04121 Lunch Steak 9
现在,这也许已经足够好了,但是您可能更希望返回一个元组列表。为此:
merged.groupby(["order_id", "order_type"]).apply(lambda x: list(zip(x["order_item"],
x["item_count"]))).reset_index().rename(columns={0:"order_items"})
给予
order_id order_type order_items
0 ORDB00980 Dinner [(Fish&Chips, 10), (Pasta, 6)]
1 ORDB10489 Lunch [(Salad, 10), (Chicken, 8)]
2 ORDK04121 Lunch [(Chicken, 5), (Steak, 9)]
3 ORDY10003 Breakfast [(Coffee, 2), (Cereal, 1)]
4 ORDZ00319 Dinner [(Fish&Chips, 9), (Pasta, 5)]
请注意,此处的丑陋是由于转换了(可能是)不足格式的数据。同样,所有for循环和apples都来自数据转换。
基本上,我的回答可以概括为:
pd.merge(df, df_meal)
如果我们假设正确的数据格式。
顺便说一句,我只是选择item_count
作为最佳猜测。
答案 2 :(得分:0)
这可能是您在“应用”功能中要执行的操作。假设breakfastMenu
,dinnerMenu
和lunchMenu
是在脚本顶部定义的,则以下功能将起作用:
def check_correct(x):
if x['order_type'] == 'lunch':
current_menu = lunchMenu
elif x['order_type'] == 'dinner':
current_menu = dinnerMenu
else:
current_menu= breakfastMenu
current_menu = [x.lower() for x in current_menu]
return_list = []
for item, _ in x['order_items']:
return_list.append(item.lower() in current_menu)
return return_list
您可以使用以下方法在DataFrame中创建新列:
df.apply(check_correct, axis = 1)
。它将为您提供有关正确与错误数学的列表。第一行将产生以下输出:
[False, True, True]
答案 3 :(得分:0)
for index, row in df.iterrows():
new = []
if row["order_type"] == "Breakfast":
order_type = breakfastMenu
elif row["order_type"] == "Dinner":
order_type = dinnerMenu
elif row["order_type"] == "Lunch":
order_type = lunchMenu
else:
continue
a = row["order_items"][1:-1]
b = a.split(",")
for i in range(0,len(b),2):
meal = b[i].strip()[2:-1]
if meal in order_type:
new.append([meal, b[i+1]])
row["order_items_new"] = new
print(df["order_items_new"])
0 [[Salad, 10)], [Chicken, 8)]]
1 [[Fish&Chips, 9)], [Pasta, 5)]]
2 [[Pasta, 6)], [Fish&Chips, 10)]]
3 [[Coffee, 2)], [Cereal, 1)]]
4 [[Steak, 9)], [Chicken, 5)]]