Question

我的数据框有x行和y列，称为df。我有另一个数据名称df2少于x行和y-1列。我想过滤df，这些行与第1列到第1列的df2行相同。有没有办法以矢量化的方式做到这一点而不需要遍历df2行？

以下是示例df的代码：

import pandas
import numpy.random as rd
dates = pandas.date_range('1/1/2000', periods=8)
df = pandas.DataFrame(rd.randn(8, 5), index=dates, columns=['call/put', 'expiration', 'strike', 'ask', 'bid'])
df.iloc[2,4]=0
df.iloc[2,3]=0
df.iloc[3,4]=0
df.iloc[3,3]=0
df.iloc[2,2]=0.5
df=df.append(df.iloc[2:3])
df.iloc[8:9,3:5]=1
df.iloc[8:9,2:3]=0.6
df=df.append(df.iloc[8:9])
df.iloc[9,2]=0.4

df2的计算方法如下：

df4=df[(df["ask"]==0) & (df["bid"]==0)]

现在我想过滤df，看起来像df2中除了列标记之外的行，其值应为0.4。过滤过程应该没有迭代，因为我的真实世界df非常大。

Answer 1

您尝试在两个数据帧上进行合并，这应该返回两者的（集合）交集。

pandas.merge (df,df2,on=['call/put','expiration','strike','ask'],left_index=True,right_index=True)


            call/put  expiration    strike  ask  bid_x  bid_y
2000-01-03  0.614738   -0.363933  0.500000    0      0      0
2000-01-03  0.614738   -0.363933  0.600000    1      1      0
2000-01-03  0.614738   -0.363933  0.400000    1      1      0
2000-01-04  1.077427   -1.046127  0.025931    0      0      0

我将df4重命名为df2 - 上面返回的数据框应该是来自df的记录的完整列表，它与df2中包含的“白名单”中的记录相匹配，基于上述语句中列出的列。

略有不同的声明，点击“点击”并将“出价”添加到要匹配的列中并返回：

pandas.merge (df,df2,on=['call/put','expiration','ask','bid'],left_index=True,right_index=True,how='inner')
            call/put  expiration  strike_x  ask  bid  strike_y
2000-01-03  0.614738   -0.363933  0.500000    0    0  0.500000
2000-01-03  0.614738   -0.363933  0.600000    1    1  0.500000
2000-01-03  0.614738   -0.363933  0.400000    1    1  0.500000
2000-01-04  1.077427   -1.046127  0.025931    0    0  0.025931

那仍然不太正确 - 我认为这是因为index = True部分。要强制它，您可以将日期索引转换为常规列，并将它们包含在匹配列中。

e.g。

df['date'] = df.index
df2['date'] = df2.index

然后

pandas.merge (df,df2,on=['call/put','expiration','ask','bid','date'],how='inner')

返回：

    call/put  expiration  strike_x  ask  bid                date  strike_y
 0  0.367269   -0.616125   0.50000    0    0 2000-01-03 00:00:00   0.50000
 1 -0.508974    0.281017   0.65791    0    0 2000-01-04 00:00:00   0.65791

我认为我更接近你想要的东西。

使用矢量化过滤Pandas Dataframe

1 个答案: