按日期合并数据框使用不等的日期

时间:2018-04-20 15:15:55

标签: python pandas date merge

我的过程就是这样:

  1. 导入包含日期,激活和取消的数据的csv
  2. 通过激活或取消对数据进行子集化
  3. 使用aggfunc' sum'
  4. 来转动数据
  5. 转换回数据框
  6. 现在,我需要将2个数据帧合并在一起,但是在一个数据帧中存在日期而在另一个数据帧中不存在日期。这两个数据框都从2017年1月1日开始到2017年12月31日结束。最好,索引月份需要填写的任何观察的输出都具有相应的值0.

    这里是两个数据框中的.head():

    enter image description here enter image description here

    作为参考,这里是代码:

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import os
    import datetime
    
    %matplotlib inline
    
    #import data
    directory1 = "C:\python\Contracts"
    directory_source = os.path.join(directory1, "Contract_Data.csv")
    df_source = pd.read_csv(directory_source)
    
    #format date ranges as times
    #df_source["Activation_Month"] = pd.to_datetime(df_source["Activation_Month"])
    #df_source["Cancellation_Month"] = pd.to_datetime(df_source["Cancellation_Month"])
    df_source["Activation_Day"] = pd.to_datetime(df_source["Activation_Day"])
    df_source["Cancellation_Day"] = pd.to_datetime(df_source["Cancellation_Day"])
    
    
    #subset the data based on status
    df_active = df_source[df_source["Order Status"]=="Active"]
    df_active = pd.DataFrame(df_active[["Activation_Day", "Event_Value"]].copy())
    df_cancelled = df_source[df_source["Order Status"]=="Cancelled"]
    df_cancelled = pd.DataFrame(df_cancelled[["Cancellation_Day", "Event_Value"]].copy())
    
    #remove activations outside 2017 and cancellations outside 2017
    df_cancelled = df_cancelled[(df_cancelled['Cancellation_Day'] > '2016-12-31') & 
                                (df_cancelled['Cancellation_Day'] <= '2017-12-31')]
    
    df_active = df_active[(df_active['Activation_Day'] > '2016-12-31') & 
                                (df_active['Activation_Day'] <= '2017-12-31')]
    
    
    
    #pivot the data to aggregate by day
    df_active_aggregated = df_active.pivot_table(index='Activation_Day',
                                                 values='Event_Value',
                                                 aggfunc='sum')
    
    df_cancelled_aggregated = df_cancelled.pivot_table(index='Cancellation_Day',
                                                       values='Event_Value',
                                                       aggfunc='sum')
    
    
    #convert pivot tables back to useable dataframes
    activations_aggregated = pd.DataFrame(df_active_aggregated.to_records())
    cancellations_aggregated = pd.DataFrame(df_cancelled_aggregated.to_records())
    
    #rename the time columns so they can be referenced when merging into one DF
    activations_aggregated.columns = ["index_month", "Activations"]
    #activations_aggregated = activations_aggregated.set_index(pd.DatetimeIndex(activations_aggregated["index_month"]))
    
    cancellations_aggregated.columns = ["index_month", "Cancellations"]
    #cancellations_aggregated = cancellations_aggregated.set_index(pd.DatetimeIndex(cancellations_aggregated["index_month"]))
    

    我知道有很多帖子可以解决与此相似的问题,但我还没有找到任何有帮助的帖子。感谢任何能够帮助我的人!

1 个答案:

答案 0 :(得分:4)

您可以尝试:

activations_aggregated.merge(cancellations_aggregated, how='outer', on='index_month').fillna(0)