我正在从mongoDB中读取数据并将其存储在pandas数据帧中,以进行进一步的探索性分析和机器学习 我的mongoDB文档看起来像这样..
{
"user_id" : "user_9",
"order_id" : "order_9",
"meals" : 5,
"order_area" : "London",
"dish" : [
{
"dish_id" : "012" ,
"dish_name" : "ABC",
"dish_type" : "Non-Veg",
"dish_price" : 135,
"dish_quantity" : 2,
"ratings" : 4,
"reviews" : "blah blah blah",
"coupon_type" : "Rs 20 off"
},
{
"dish_id" : "013" ,
"dish_name" : "XYZ",
"dish_type" : "Non-Veg",
"dish_price" : 125,
"dish_quantity" : 3,
"ratings" : 4,
"reviews" : "blah blah blah",
"coupon_type" : "Rs 20 off"
},
],
}
一旦我在python中获取数据,我使用json_normalize分割菜相关属性,同时将其插入数据帧
df= json_normalize(db.dataset2.find(), 'dish',
['_id','user_id','order_id','order_time','meals','order_area']
让我跟随熊猫
coupon_type dish_id dish_name dish_price dish_quantity
0 Rs 20 off 012 ABC 135 2
1 Rs 20 off 013 XYZ 125 3
ratings reviews coupon_type user_id order_id meals order_area
0 4 blah blah blah Rs 20 off 9 9 5 London
1 4 blah blah blah Rs 20 off 9 9 5 London
问题是数据重复(user_id,order_id,meal,_id& order_area) 另一种方法是将这些数据存储在一个没有重复的数据框中吗?
答案 0 :(得分:1)
您可能正在寻找一个MultiIndex
,至少看起来避免duplication
- (see docs):
df = json_normalize(data, 'dish', ['user_id', 'order_id', 'meals', 'order_area'])
df = df.set_index(['user_id','order_id', 'meals', 'order_area'])
coupon_type dish_id dish_name dish_price \
user_id order_id meals order_area
user_9 order_9 5 London Rs 20 off 012 ABC 135
Rs 20 off 013 XYZ 125
dish_quantity dish_type ratings \
user_id order_id meals order_area
user_9 order_9 5 London 2 Non-Veg 4
3 Non-Veg 4
reviews
user_id order_id meals order_area
user_9 order_9 5 London blah blah blah
blah blah blah