Question

我正在使用pandas中的两个数据框（df_A有~6500行和df_B~750万）并且已经达到了一个我无法想到避免遍历行的方法。

这就是df_A的样子：

bidders  other_col1   other_col2
'abcd'      2             3
'efgh'      123           4

这就是df_B的样子：

bidders    time   other_col3
'abcd'     23456       67
'abcd'     23456       43
'jklm'     7896       190
'jklm'     7896       456

基本上我需要将df_A中的每个独特出价者与df_B中的出价者进行比较。然后我需要找到投标人ID匹配的所有唯一时间戳，对于那些唯一的时间戳，我需要遍历df_B以查找df_B中的投标人出现的相同时间戳的次数

这就是我的脚本的样子：

simul_count = 0:

for bidder in df_A['bidders']:

   loc = (df_B['bidders'] == bidder)
   unique_times = pd.unique(df_B.loc[loc, 'time'])

   for time in unique_times:

      loc1 = (df_B['bidders'] == bidder) & (df_B['time'] == time)

      if len(df_B.loc1[loc1, 'bidders']) > 1:
          simul_count += 1

因此，如果我们运行上面的代码，simul_count = 1表示我提供的样本数据，因为Bidder'abcd'同时进行了两次出价。我知道这个操作将在python中永生。使用numpy函数和数组可以提供一个小的提升，我想;但有更快的方法吗？

编辑：为了清楚起见，脚本应该输出唯一的投标人ID，同时出价的次数以及同时出价的时间戳。时间戳会使检查建议更容易：）

Answer 1

这样做你想要的吗？：

bidder_counts = pd.merge(df_A,df_B).groupby(['bidders', 'time']).count()
bidder_counts[bidder_counts.other_col1 > 1].other_col1

bidders  time 
'abcd'   23456    2
Name: other_col1, dtype: int64

（编辑更详细地解释我的答案）： Pandas合并就像一个SQL INNER JOIN，默认情况下它将连接到任何公共列;在这种情况下，列出价者＆＃39;：

pd.merge(df_A,df_B)
  bidders  other_col1  other_col2   time  other_col3
0  'abcd'           2           3  23456          67
1  'abcd'           2           3  23456          43

groupby与SQL中的GROUP BY相同;基本上它会遍历您传递它的列的不同值，然后您可以在每个组上执行您想要的任何聚合函数。在这里，我们只是在做数，但如果你愿意，你可以做总结或其他事情。

pd.merge(df_A,df_B).groupby('time').count()

       bidders  other_col1  other_col2  other_col3
time                                              
23456        2           2           2           2

最后，我根据计数＆gt; 2进行过滤，然后返回结果的长度;我必须指定要过滤的列，因此我选择了投标人，因为我认为它更具语义感：bidder_counts[bidder_counts.bidders > 1];但是请注意所有的计数都是一样的，所以我也可以做bidder_counts[bidder_counts.other_col1 > 1]

Answer 2

根据我的理解，您想要计算出价者出现在同一时间戳上的次数，并且这些出价者需要出现在df_A中。在此，a和b分别是您的df_A和df_B。

In [35]: a
Out[35]: 
  bidders  other_col1  other_col2
0  'abcd'           2           3
1  'efgh'         123           4

向count添加了b列：

In [36]: b
Out[36]: 
  bidders   time  other_col3  count
0  'abcd'  23456          67      1
1  'abcd'  23456          43      1
2  'jklm'   7896         190      1
3  'jklm'   7896         456      1

appearances = b.groupby(['bidders','time']).sum()['count'].reset_index()

In [38]: appearances
Out[38]: 
  bidders   time  count
0  'abcd'  23456      2
1  'jklm'   7896      2

In [39]: a.merge(appearances, how='right').drop(['other_col1','other_col2'], axis=1)
Out[39]: 
  bidders   time  count
0  'abcd'  23456      2
1  'jklm'   7896      2

Answer 3

这是一种方法

首先从df_A

获取独特的竞标者

In [32]: bidders=df_A['bidders'].unique()

In [33]: bidders
Out[33]: array(["'abcd'", "'efgh'"], dtype=object)

然后提取所选投标人df_B['bidders'].isin(bidders)的数据，在bidders上分组并计算time的唯一值

In [34]: df_B[df_B['bidders'].isin(bidders)].groupby('bidders')['time'].nunique()
Out[34]:
bidders
'abcd'     1
Name: time, dtype: int64

如果您想获取每个时间戳的出价工具ID，请按time分组，通过x['bidders'].unique()

中的apply()获取唯一的出价人ID列表

In [35]: (df_B[df_B['bidders'].isin(bidders)]
              .groupby('time')
              .apply(lambda x: x['bidders'].unique()))
Out[35]:
time
23456    ['abcd']
dtype: object

Answer 4

第1步：（DF_B_1） - 从B，获得所有不同的不同，时间和数量第2步：（DF_B_2）仅在数量> 1的情况下过滤记录（那些是潜在的候选人）

第3步：使用bidder列合并DF_B_2和DF_A，以及how = inner。这将过滤掉不在DF_A

中的投标人

pseudo code:

DF_B_1 = df_B.groupBy(["bidders","time"]).count.sum()
DF_B_2 = DF_B_1[count>1]
res = merge(left=DF_B_2,right=df_A,how=inner)

pandas比较数据帧切片

4 个答案: