我的数据集基于芝加哥市食品检验的结果。
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
以下是每个工具在数据集中出现的频率:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
我正在尝试创建一组新数据,其中包含其设施类型在数据集中出现超过50次的行。我该如何处理?
请注意,设施数量清单是最大的,因为我已经删除了大部分信息,因为它没有对手头的问题做出贡献(因此只需删除“特殊事件”,“共享厨房用户”的出现,以及“日托”不是我想要的。)
答案 0 :(得分:2)
IIUC那么你想要filter
:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
示例:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
答案 1 :(得分:0)
未经测试,但应该可以使用。
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]