Question

我有两个数据框，正在尝试使用一个数据框作为过滤df，另一个是被过滤的数据框。这是两个数据框的样子：

df #(filtering)
   phrase1   date1
0  cat       2012-03-04    
1  tree      2015-05-02    
2  snail     2002-08-27 
3  dog       2004-02-27 

df1 #(being filtered)
   id       phrase2        date2
0  abc12    cat nip        2003-03-04  
1  def34    baobab tree    2009-05-02    
2  ghi56    lazy dog       2011-08-27 
3  jkl78    poor snail     2014-08-27 
4  mno90    fat cat        2008-08-27

我正在尝试实现一些逻辑，其中：

如果数据帧phase1的{{1}}列中的任何字符串与数据帧df的{{1}}列中的任何字符串匹配， AND ，如果数据帧phase2中的df1 在数据帧date1中{}之前：

-删除与df
如果数据帧date2的{{1}}列中的任何字符串与数据帧df1的{{1}}列中的任何字符串匹配， AND ，如果数据框df1['phrase2']中的phase1是之后是数据框df中的{em> phase2：

-保留与df1

中匹配的单词

虽然我不知道该怎么做。我尝试摆弄date1并使用＆运算符加入两个条件（例如df），但是它总是让我不愿再为它复杂。请帮忙。

预期结果：

date2

Answer 1

使用：

#create dict for map if one word phrase
d = df.set_index('phrase1')['date1'].to_dict()

#if splitted strings like in original df
#d = {c: b for a, b in zip(df['phrase1'], df['date1']) for c in a.split()}
#print (d)

#join togther for list of tuples
zipped = zip(df1['phrase2'], df1['date2'])
#max Timestamp contant
mt = pd.Timestamp.max
#nested list comprehension with filtering
a = [' '.join([y for y in a.split() if not (d.get(y, mt) < b and y in d)]) for a, b in zipped]
print (a)
['cat nip', 'baobab tree', 'lazy', 'poor', 'fat cat']

df1['phrase2'] = a
print (df1)
      id      phrase2      date2
0  abc12      cat nip 2003-03-04
1  def34  baobab tree 2009-05-02
2  ghi56         lazy 2011-08-27
3  jkl78         poor 2014-08-27
4  mno90      fat cat 2008-08-27

在多个数据框之间进行多重条件检查和过滤

1 个答案: