用熊猫保留下n行和前k行合并

时间:2019-02-06 13:27:18

标签: python pandas

我正在合并两个df,并且想在匹配后访问列的前n行。

events_df['event']prices_df['date']之间匹配的地方

以及之间的匹配项

events_df['ticker']prices_df['tic']

我想保留prices_df['price']

中的匹配行之后的前n个值。
events_df

  event ticker
0 01-01-2019  MSFT 
1 12-12-2018  MSFT 
2 12-11-2018  MSFT   
3 02-03-2019  AAPL 
4 12-12-2018  AAPL 
5 12-11-2018  AAPL 
6 01-01-2019  AAPL 


prices_df

  date tic price 
0 01-01-2019 MSFT 1.0
1 02-01-2019 MSFT 1.1
2 03-01-2019 MSFT 1.2
3 04-01-2019 MSFT 1.3
4 05-01-2019 MSFT 1.4 
5 01-01-2019 AAPL 2.0
6 02-01-2019 AAPL 2.1
7 03-01-2019 AAPL 2.2
8 04-01-2019 AAPL 2.3
9 05-01-2019 AAPL 2.4

我已经尝试合并

merged = events_df.merge(prices_df,left_on=['ticker','event'],right_on=['tic','date'])

n = 4的预期输出(来自匹配的events_df['events']索引0,6)

  date ticker price
0 01-01-2019 MSFT 1.0
1 02-01-2019 MSFT 1.1
2 03-01-2019 MSFT 1.2
3 04-01-2019 MSFT 1.3
4 01-01-2019 AAPL 2.0
5 02-01-2019 AAPL 2.1
6 03-01-2019 AAPL 2.2
7 04-01-2019 AAPL 2.3

2 个答案:

答案 0 :(得分:0)

使用:

#changed sample data for more general
print (prices_df)
          date   tic  price
0   01-01-2018  MSFT    1.0
1   01-01-2019  MSFT    1.0
2   02-01-2019  MSFT    1.1
3   03-01-2019  MSFT    1.2
4   04-01-2019  MSFT    1.3
5   05-01-2019  MSFT    1.4
6   01-01-2019  AAPL    2.0
7   02-01-2019  AAPL    2.1
8   03-01-2019  AAPL    2.2
9   04-01-2019  AAPL    2.3
10  05-01-2019  AAPL    2.4

#n to down, k to up
n = 2 
k = 1
#get index by reset_index for avoid lost it
idx = events_df.merge(prices_df.rename_axis('idx').reset_index(),
                         left_on=['ticker','event'],
                         right_on=['tic','date'])['idx']

print (idx)
0    1
1    6
Name: idx, dtype: int64

#create groups by matching with original index, [::-1] for change ordering
s1 = prices_df.index.isin(idx).cumsum()
s2 = prices_df.index.isin(idx)[::-1].cumsum()

#repalce first and last groups to NaNs 
up = np.where(s1 != 0, s1, np.nan)
lo = np.where(s2[::-1] != 0, s2[::-1] , np.nan)

#get couters compare by le (<=) and remove NaNs groups (first, last)
prices_df['um'] = prices_df.groupby(up).cumcount().le(n) & ~np.isnan(up)
prices_df['lm'] = prices_df.groupby(lo).cumcount(ascending=False).le(k) & ~np.isnan(lo)
print (prices_df)
          date   tic  price     um     lm
0   01-01-2018  MSFT    1.0  False   True
1   01-01-2019  MSFT    1.0   True   True
2   02-01-2019  MSFT    1.1   True  False
3   03-01-2019  MSFT    1.2   True  False
4   04-01-2019  MSFT    1.3  False  False
5   05-01-2019  MSFT    1.4  False   True
6   01-01-2019  AAPL    2.0   True   True
7   02-01-2019  AAPL    2.1   True  False
8   03-01-2019  AAPL    2.2   True  False
9   04-01-2019  AAPL    2.3  False  False
10  05-01-2019  AAPL    2.4  False  False

#filter by boolean indexing
mask = prices_df['um'] | prices_df['lm'] 
prices_df = prices_df[mask]
print (prices_df)
         date   tic  price     um     lm
0  01-01-2018  MSFT    1.0  False   True
1  01-01-2019  MSFT    1.0   True   True
2  02-01-2019  MSFT    1.1   True  False
3  03-01-2019  MSFT    1.2   True  False
5  05-01-2019  MSFT    1.4  False   True
6  01-01-2019  AAPL    2.0   True   True
7  02-01-2019  AAPL    2.1   True  False
8  03-01-2019  AAPL    2.2   True  False

答案 1 :(得分:0)

您的合并看起来还不错。您只需要从中提取所需的列,因为合并后立即包含了两个DataFrame中的所有列。所以:

merged = events_df.merge(prices_df, left_on=['ticker', 'event'], right_on=['tic', 'date'])
merged = merged['date', 'picker', 'price']

然后,您必须对其进行过滤,以使价格低于3(如果需要,则为n):

n = 3
merged = merged[merged['price'] < n]