Question

我正在尝试将OR |与df.loc组合以提取数据。我编写的代码提取了csv文件中的所有内容。这是原始的csv文件：https://drive.google.com/open?id=16eo29mF0pn_qNw-BGpZyVM9PBxv2aN1G

import pandas as pd

df = pd.read_csv("yelp_business.csv")
df = df.loc[(df['categories'].str.contains('chinese', case = False)) | (df['name'].str.contains('subway', case = False)) | (df['categories'].str.contains('', case = False)) | (df['address'].str.contains('', case = False))]

print df

似乎空白引号''在str.contains中不起作用，或者OR |在df.loc中不起作用。它不仅返回带有chinese个餐厅（编号为4171的行和带有餐厅名subway的行，还返回所有174,568行。

已编辑

考虑到该地址可能没有任何赋值或为空，我想要的输出应该是类别chinese的所有行和名称subway的所有行。

import pandas as pd

df = pd.read_csv("yelp_business.csv")

cusine = 'chinese'
name = 'subway'
address #address has no assigned value or is NULL

df = df.loc[(df['categories'].str.contains(cusine, case = False)) |
            (df['name'].str.contains(name, case = False)) | 
            (df['address'].str.contains(address, case = False))]


print df

此代码给我一个错误NameError: name 'address' is not defined。

Answer 1

我认为|列可能是categories的链条条件，要查找空字符串请使用^""$-它用引号将字符串的开头和结尾匹配：

df = pd.read_csv("yelp_business.csv")

df1 = df.loc[(df['categories'].str.contains('chinese|^""$', case = False)) |
            (df['name'].str.contains('subway', case = False)) | 
            (df['address'].str.contains('^""$', case = False))]
print (len(df1))
11320

print (df1.head())

               business_id                     name neighborhood  \
9   TGWhGNusxyMaA4kQVBNeew  "Detailing Gone Mobile"          NaN   
53  4srfPk1s8nlm1YusyDUbjg              ***"Subway"    Southeast   
57  spDZkD6cp0JUUm6ghIWHzA              "Kitchen M"   Unionville   
63  r6Jw8oRCeumxu7Y1WRxT7A           "D&D Cleaning"          NaN   
88  YhV93k9uiMdr3FlV4FHjwA        "Caviness Studio"          NaN   

                          address       city state postal_code   latitude  \
9                           ***""  Henderson    NV       89014  36.055825   
53  "6889 S Eastern Ave, Ste 101"  Las Vegas    NV       89119  36.064652   
57            "8515 McCowan Road"    Markham    ON     L3P 5E5  43.867918   
63                          ***""     Urbana    IL       61802  40.110588   
88                          ***""    Phoenix    AZ       85001  33.449967   

     longitude  stars  review_count  is_open  \
9  -115.046350    5.0             7        1   
53 -115.118954    2.5             6        1   
57  -79.283687    3.0            80        1   
63  -88.207270    5.0             4        0   
88 -112.070223    5.0             4        1   

                                           categories  
9                           Automotive;Auto Detailing  
53                   Fast Food;Restaurants;Sandwiches  
57                             ***Restaurants;Chinese  
63         Home Cleaning;Home Services;Window Washing  
88  Marketing;Men's Clothing;Restaurants;Graphic D...

编辑：如果需要过滤出空值和NaNs值：

df2 = df.loc[(df['categories'].str.contains('chinese', case = False)) |
            (df['name'].str.contains('subway', case = False)) & 
           ~((df['address'] == '""') | (df['categories'] == '""'))]

print (df2.head())
                business_id              name     neighborhood  \
53   4srfPk1s8nlm1YusyDUbjg          "Subway"        Southeast   
57   spDZkD6cp0JUUm6ghIWHzA       "Kitchen M"       Unionville   
96   dTWfATVrBfKj7Vdn0qWVWg  "Flavor Cuisine"      Scarborough   
126  WUiDaFQRZ8wKYGLvmjFjAw    "China Buffet"  University City   
145  vzx1WdVivFsaN4QYrez2rw          "Subway"              NaN   

                                 address       city state postal_code  \
53         "6889 S Eastern Ave, Ste 101"  Las Vegas    NV       89119   
57                   "8515 McCowan Road"    Markham    ON     L3P 5E5   
96                "8 Glen Watford Drive"    Toronto    ON     M1S 2C1   
126  "8630 University Executive Park Dr"  Charlotte    NC       28262   
145                   "5111 Boulder Hwy"  Las Vegas    NV       89122   

      latitude   longitude  stars  review_count  is_open  \
53   36.064652 -115.118954    2.5             6        1   
57   43.867918  -79.283687    3.0            80        1   
96   43.787061  -79.276166    3.0             6        1   
126  35.306173  -80.752672    3.5            76        1   
145  36.112895 -115.062353    3.0             3        1   

                                 categories  
53         Fast Food;Restaurants;Sandwiches  
57                      Restaurants;Chinese  
96           Restaurants;Chinese;Food Court  
126  Buffets;Restaurants;Sushi Bars;Chinese  
145        Sandwiches;Restaurants;Fast Food

Answer 2

找到有关包含的详细信息 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.contains.html

数据框中的空白引号用法

2 个答案: