我有这个数据集:
其中的烹饪国家/地区屡屡发生,我想输出的是每个国家/地区最受欢迎的5种食品原料清单。
到目前为止的代码:
import pandas as pd
from collections import Counter
filename="food.json"
food_dataset = pd.read_json(filename)
#getting seperate columns
country = food_dataset.loc[:,"country"]
ingredients = food_dataset.loc[:,"ingredients"]
Counter = Counter(ingredients)
most_occur = Counter.most_common(3)
print(most_occur)
答案 0 :(得分:1)
使用DataFrame.explode
由计数器创建的带有GroupBy.apply
的熊猫0.25+ Series.value_counts
和具有前5个索引的lambd函数的解决方案:
food_dataset = pd.DataFrame({'cuisine':['greek','southern_us'],
'ingredients':[list('andnsndnfndn'),
list('ndnsndnfnsnd')]})
print (food_dataset)
cuisine ingredients
0 greek [a, n, d, n, s, n, d, n, f, n, d, n]
1 southern_us [n, d, n, s, n, d, n, f, n, s, n, d]
N = 3
df = (food_dataset.explode("ingredients")
.groupby('cuisine')['ingredients']
.apply(lambda x: x.value_counts().index[:N].tolist())
.reset_index())
print (df)
cuisine ingredients
0 greek [n, d, a]
1 southern_us [n, d, s]
替代解决方案:
food_dataset['top'] = (food_dataset['ingredients']
.apply(lambda x: [y[0] for y in Counter(x).most_common(N)]))
print (food_dataset)
cuisine ingredients top
0 greek [a, n, d, n, s, n, d, n, f, n, d, n] [n, d, a]
1 southern_us [n, d, n, s, n, d, n, f, n, s, n, d] [n, d, s]
df = (food_dataset.explode("ingredients")
.groupby('cuisine')['ingredients']
.apply(lambda x: [y[0] for y in Counter(x).most_common(N)])
.reset_index())
print (df)
cuisine ingredients
0 greek [n, d, a]
1 southern_us [n, d, s]
如果cousine
列中的每个值都是唯一的解决方案:
food_dataset['top'] = (food_dataset['ingredients']
.apply(lambda x: [y[0] for y in Counter(x).most_common(N)]))
print (food_dataset)
cuisine ingredients top
0 greek [a, n, d, n, s, n, d, n, f, n, d, n] [n, d, a]
1 southern_us [n, d, n, s, n, d, n, f, n, s, n, d] [n, d, s]