数据结构:
我感兴趣的列表(business_df
)的pandas DataFrame(category
)
包含餐馆类别(restaurant_categories_list
)
我想做什么:
如果至少有一个列出的类别与餐厅中的至少一个匹配,则通过将商家归类为餐馆,根据business_df
列(具有列表结构)过滤category
中的商家类别。
我检查了这2个问题,但他们没有为我的问题提供答案:
Filter dataframe rows if value in column is in a set list of values
use a list of values to select rows from a pandas dataframe
我现在正在运行此代码:
restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])
以下是我感兴趣的专栏:
0 ['Fast Food', 'Restaurants']
1 ['Nightlife']
2 ['Auto Repair', 'Automotive']
3 ['Active Life', 'Mini Golf', 'Golf']
4 ['Shopping', 'Home Services', 'Internet Servic...
5 ['Bars', 'American (New)', 'Nightlife', 'Loung...
6 ['Active Life', 'Trainers', 'Fitness & Instruc...
7 ['Bars', 'American (Traditional)', 'Nightlife'...
8 ['Auto Repair', 'Automotive', 'Tires']
9 ['Active Life', 'Mini Golf']
10 ['Home Services', 'Contractors']
11 ['Veterinarians', 'Pets']
12 ['Libraries', 'Public Services & Government']
13 ['Automotive', 'Auto Parts & Supplies']
14 ['Burgers', 'Breakfast & Brunch', 'American (T...
因此,如果我只使用这些行,我的预期数据帧应该只包含第0行和第14行。
答案 0 :(得分:1)
<强>更新强>
此版本使用ast.literal_eval()
来从字符串反序列化列表,它似乎正常工作:
import ast
import pandas as pd
restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)
df = df_orig[(pd.notnull(df_orig['categories']))]
mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)
但正如@CorleyBrigman已经说过使用Pandas使用这样的数据结构非常困难且非常低效...
基于样本数据的旧答案:
你可以将列表转换为列/系列,然后使用pd.isin()
函数生成一个True / False值矩阵,可以求和(因为在Python中:False == 0和True == 1) :
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])
说明:
print(df['business'].apply(pd.Series))
0 1 2 3
0 Fast Food Restaurants NaN NaN
1 Nightlife NaN NaN NaN
2 Auto Repair Automotive NaN NaN
3 Active Life Mini Golf Golf NaN
4 Shopping Home Services Internet Servic NaN
5 Bars American (New) Nightlife Loung
6 Active Life Trainers Fitness & Instruc NaN
7 Bars American (Traditional) Nightlife NaN
8 Auto Repair Automotive Tires NaN
9 Active Life Mini Golf NaN NaN
10 Home Services Contractors NaN NaN
11 Veterinarians Pets NaN NaN
12 Libraries Public Services & Government NaN NaN
13 Automotive Auto Parts & Supplies NaN NaN
14 Burgers Breakfast & Brunch American NaN
然后
print(df['business'].apply(pd.Series).isin(restaurant_categories_list))
输出:
0 1 2 3
0 False True False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
5 False False False False
6 False False False False
7 False False False False
8 False False False False
9 False False False False
10 False False False False
11 False False False False
12 False False False False
13 False False False False
14 True True False False
然后
mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)
输出:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 True
dtype: bool
最后:
print(df[(mask)])
输出:
business
0 [Fast Food, Restaurants]
14 [Burgers, Breakfast & Brunch, American]