过滤作为列表的DataFrame行

时间:2016-03-13 14:15:29

标签: python pandas

数据结构:

  • 我感兴趣的列表(business_df)的pandas DataFrame(category

  • 包含餐馆类别(restaurant_categories_list

  • 的列表

我想做什么:

如果至少有一个列出的类别与餐厅中的至少一个匹配,则通过将商家归类为餐馆,根据business_df列(具有列表结构)过滤category中的商家类别。

我检查了这2个问题,但他们没有为我的问题提供答案:

Filter dataframe rows if value in column is in a set list of values

use a list of values to select rows from a pandas dataframe

我现在正在运行此代码:

restaurant_categories_list = ['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']
 print(business_df.loc[business_df['categories'].isin(restaurant_categories_list)])

以下是我感兴趣的专栏:

0                          ['Fast Food', 'Restaurants']
1                                         ['Nightlife']
2                         ['Auto Repair', 'Automotive']
3                  ['Active Life', 'Mini Golf', 'Golf']
4     ['Shopping', 'Home Services', 'Internet Servic...
5     ['Bars', 'American (New)', 'Nightlife', 'Loung...
6     ['Active Life', 'Trainers', 'Fitness & Instruc...
7     ['Bars', 'American (Traditional)', 'Nightlife'...
8                ['Auto Repair', 'Automotive', 'Tires']
9                          ['Active Life', 'Mini Golf']
10                     ['Home Services', 'Contractors']
11                            ['Veterinarians', 'Pets']
12        ['Libraries', 'Public Services & Government']
13              ['Automotive', 'Auto Parts & Supplies']
14    ['Burgers', 'Breakfast & Brunch', 'American (T...

因此,如果我只使用这些行,我的预期数据帧应该只包含第0行和第14行。

1 个答案:

答案 0 :(得分:1)

<强>更新

此版本使用ast.literal_eval()来从字符串反序列化列表,它似乎正常工作:

import ast
import pandas as pd

restaurant_categories_list=['Soup','Sandwiches','Salad', 'Restaurants','Burgers', 'Breakfast & Brunch']

df_orig = pd.read_csv('yelp_academic_dataset_business.csv', low_memory=False)

df = df_orig[(pd.notnull(df_orig['categories']))]

mask = df['categories'].apply(ast.literal_eval).apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0

print(df.ix[mask, ['categories']])
df[mask].to_csv('result.csv', index=False)

但正如@CorleyBrigman已经说过使用Pandas使用这样的数据结构非常困难且非常低效...

基于样本数据的旧答案:

你可以将列表转换为列/系列,然后使用pd.isin()函数生成一个True / False值矩阵,可以求和(因为在Python中:False == 0和True == 1) :

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(df[(mask)])

说明:

print(df['business'].apply(pd.Series))

                0                             1                  2      3
0       Fast Food                   Restaurants                NaN    NaN
1       Nightlife                           NaN                NaN    NaN
2     Auto Repair                    Automotive                NaN    NaN
3     Active Life                     Mini Golf               Golf    NaN
4        Shopping                 Home Services    Internet Servic    NaN
5            Bars                American (New)          Nightlife  Loung
6     Active Life                      Trainers  Fitness & Instruc    NaN
7            Bars        American (Traditional)          Nightlife    NaN
8     Auto Repair                    Automotive              Tires    NaN
9     Active Life                     Mini Golf                NaN    NaN
10  Home Services                   Contractors                NaN    NaN
11  Veterinarians                          Pets                NaN    NaN
12      Libraries  Public Services & Government                NaN    NaN
13     Automotive         Auto Parts & Supplies                NaN    NaN
14        Burgers            Breakfast & Brunch           American    NaN

然后

print(df['business'].apply(pd.Series).isin(restaurant_categories_list))

输出:

        0      1      2      3
0   False   True  False  False
1   False  False  False  False
2   False  False  False  False
3   False  False  False  False
4   False  False  False  False
5   False  False  False  False
6   False  False  False  False
7   False  False  False  False
8   False  False  False  False
9   False  False  False  False
10  False  False  False  False
11  False  False  False  False
12  False  False  False  False
13  False  False  False  False
14   True   True  False  False

然后

mask = df['business'].apply(pd.Series).isin(restaurant_categories_list).sum(axis=1) > 0
print(mask)

输出:

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14     True
dtype: bool

最后:

print(df[(mask)])

输出:

                                   business
0                  [Fast Food, Restaurants]
14  [Burgers, Breakfast & Brunch, American]