用“ in”运算符对熊猫进行布尔索引?

时间:2020-06-09 22:04:27

标签: python pandas

我有一个像这样的pandas DataFrame:

dict = {'plan_id':["4H", "40", "HA", "H5", '5B'], 
    'planproduct': ["4H - MMP", "40 - STAR", "9H - STAR+PLUS", "HA - MMP", 'C4 - STAR+PLUS'], 
    'juliandat':['114', '157', '149', '142', '150']}

df = pd.DataFrame(dict, index = [1, 2, 3, 4, 5])

说我有一些列表,例如:

starplus_id = ['47', '9H', 'H5', '5B', 'C4']
mmp_pp = ['4H - MMP', 'HA - MMP', '9K - MMP']
mmp_id = ['4H','HA','9K']
starplus_pp = ['47 - STAR+PLUS', '9H - STAR+PLUS', 'H5 - STAR+PLUS', '5B - STAR+PLUS', 'C4 - STAR+PLUS']

我要过滤掉的行,如果plan_id值是'starplus_id'值之一,则planproduct字段不能是 mmp_id 值,反之亦然。 如果planproduct是“ starplus_pp ”之一,则plan_id不能是“ mmp_id ”值之一反之亦然。另外,如果plan_id不同于“ starplus_id ”,也可以。 (我在代码括号中包括了列名,在斜体中包括了list_names)。

我不知道该怎么做。我尝试使用in运算符,例如:

df = final[((df['plan_id'] in starplus_id) & (df['planproduct'] not in mmp_pp)) & 
       ((df['plan_id'] in mmp_id) & (df['planproduct'] not in starplus_pp)) &
      ((df['planproduct'] in starplus_pp) & (df['plan_id'] not in mmp_id)) &
       ((df['planproduct'] in mmp_pp) & (df['plan_id'] not in starplus_id)) |
       (df['plan_id'] not in starplus_pp)
      ]

但是我得到

ValueError:系列的真值不明确。使用a.empty,a.bool(),a.item(),a.any()或a.all()。

这是我尝试在熊猫中执行的更复杂的布尔索引,不确定如何执行。结果应该看起来像

plan_id planproduct juliandate 1 4H 4H - MMP 114 2 40 40 - STAR 157 5 5B C4 - STAR+PLUS 150

1 个答案:

答案 0 :(得分:1)

看看我的尝试。我修改了starplus_pp以摆脱whitespace,+,-,因为str.contains方法在捕获字符时存在问题。这就需要创建临时列,而这些列在最后iloc访问器中就没有了。

#临时列

df['planproducts']=df['planproduct'].str.replace('[-+\s]','')#Concats values to match list and escape space,+-
df['planproductsz']=df['planproduct'].str.split('-').str[0]#Extracts the first phrase in planproduct

修改后的列表

starplus_id = ['47', '9H', 'H5', '5B', 'C4']
mmp_pp = ['4H - MMP', 'HA - MMP', '9K - MMP']
mmp_id = ['4H','HA','9K']
starplus_pp = ['47STARPLUS', '9HSTARPLUS', 'H5STARPLUS', '5BSTARPLUS', 'C4STARPLUS']#Modified list

使用.join构造字符串

sid='|'.join(starplus_id)
mp='|'.join(mmp_pp)
sp='|'.join(starplus_pp)
mid='|'.join(mmp_id)

查询

df2=df[~((df.plan_id.str.contains(sid))&(df.planproductsz.str.contains(mid)))]
#df2[~((df2.planproducts.str.contains(sp)&df2.plan_id.str.contains(mid)))]
df2[~((df2.planproducts.str.contains(sp)&df2.plan_id.str.contains(mid)))].iloc[:,:3:]

    plan_id planproduct     juliandat
1   4H      4H - MMP         114
2   40      40 - STAR        157
5   5B      C4 - STAR+PLUS   150