选项2：

Question

我有两个数据帧DF1和DF2。

DF1：

StartDate

1/1/2013
2/1/2013
11/1/2014
4/1/2014
5/1/2015

DF2：

EmploymentType        EmpStatus           EmpStartDate

Employee              Active              11/5/2012
Employee              Active              9/10/2012
Employee              Active              10/15/2013
Employee              Active              10/29/2013
Employee              Terminated          10/29/2013
Contractor            Terminated          11/20/2014
Contractor            Active              11/20/2014

我想要DF2中的行数，其中EmploymentType ='Employee'并且EmpStatus ='Active'和EmpStartDate <= DF1的开始日期

输出：

Start Date    Count

1/1/2013      2
2/1/2013      2
11/1/2014     4
4/1/2014      4
5/1/2015      4

如何在不合并两个数据帧的情况下实现这一目标？

由于没有通用键，并且由于需要基于条件的行数，因此我无法合并数据框，因此我无法在任何临时列上合并数据框，因为我需要避免交叉连接。

Answer 1

如果数据框太大，则可以使用笛卡尔联接和过滤来实现：

(df1.assign(key=1)
   .merge(df2.query('EmploymentType == "Employee" and EmpStatus=="Active"').assign(key=1), 
          on='key')
   .query('EmpStartDate <= StartDate')
   .groupby('StartDate')['key'].count())

输出：

StartDate
2013-01-01    2
2013-02-01    2
2014-04-01    4
2014-11-01    4
2015-05-01    4
Name: key, dtype: int64

详细信息：

使用query过滤df2以包括EmploymentType和EmpStatus 分别等于Employee和Active。
为每个数据帧分配一个虚拟密钥，并在虚拟密钥上使用merge以创建一个虚拟密钥。所有记录的笛卡尔联接。
使用query来过滤只连接到以下记录的联接结果： EmpStartDate小于或等于StartDate。
最后，groupby StartDate和count。

此外，请注意，使用query是一种快捷方式。如果列名包含特殊字符或空格，则需要使用布尔索引来过滤数据框。

选项2：

pd.merge_asof(df2.query('EmploymentType == "Employee" and EmpStatus == "Active"').sort_values('EmpStartDate'), 
              df1.sort_values('StartDate'), 
              left_on='EmpStartDate', 
              right_on='StartDate', 
              direction='forward')\
  .groupby('StartDate')['EmploymentType'].count()\
  .reindex(df1.StartDate.sort_values())\
  .cumsum()\
  .ffill()

输出：

StartDate
2013-01-01    2.0
2013-02-01    2.0
2014-04-01    4.0
2014-11-01    4.0
2015-05-01    4.0
Name: EmploymentType, dtype: float64

详细信息：

使用pd.merge_asof将df2过滤器向下连接至最接近的df1 前瞻性日期。
groupby从df1加入开始日期并开始计数。
reindex结果可为开始日期
使用cumsum模仿<=功能和总和。
使用fillna用以前的金额填充缺失的记录。

Answer 2

def compensation(x):
return DF2[DF2['EmpStartDate']<x
 and  DF2['EmpStatus']=='Active'].shape[0]

DF1['Count']=DF1['StartDate']
       .apply(lambda x:  
                   compensation(x),axis=1)

该方法是布尔索引和计数行。 https://pandas.pydata.org/pandas-docs/stable/indexing.html

比较两个数据框的列而不合并数据框

2 个答案:

详细信息：

选项2：