Question

我基本上对Python熊猫还很陌生，因此很高兴能获得有关该团体的帮助。

比方说，对于每个客户ID，我都有两个来自不同来源的日期字段，如下所示：

id  date_source1 date_source2
1    1/11/2017    15/11/2017
2    3/3/2018
3                  4/4/2018
4    1/10/2017     1/9/2017

给定的客户可能同时填充两个字段或仅填充其中一个字段。

如果两者都被填充，我只想创建一个新字段date_final为：

date_source2（如果位于date_source1之前
date_source2（如果在date_source1之后，但与date_source1在同一年月）
否则，以date_source1

在上面的示例中，date_final为：

15/11/2017 for ID 1,
3/3/2018 for ID 2,
4/4/2018 for ID 3,
1/9/2017 for ID 4

请告诉我这是否有帮助。谢谢！

Answer 1

我建议使用熊猫df.apply根据其他列值来计算新列。然后，您可以定义一个将行作为输入并计算所需内容的函数。您可以按名称或按位置引用行元素，如下所示。

In [44]: import pandas as pd 
    ...:                                                                                 

In [45]: df = pd.DataFrame({'a':[1,2,3],'b':[0,10,None]})                                


In [46]: df                                                                              
Out[46]: 
   a     b
0  1   0.0
1  2  10.0
2  3   NaN

In [50]: def comp(row): 
    ...:     if pd.isna(row[1]):  
    ...:         return 'invalid' 
    ...:     if row[0] > row[1]: 
    ...:         return 'col_a' 
    ...:     else: 
    ...:         return 'col_b' 
    ...:  
    ...:                                                                                 

In [51]: df['compared'] = df.apply(comp, axis=1)                                         

In [52]: df                                                                              
Out[52]: 
   a     b compared
0  1   0.0    col_a
1  2  10.0    col_b
2  3   NaN  invalid

如果您走这条路，您的比较可能会很复杂。您还应该处理数据框中的NaN。

Answer 2

按照您用语言描述的算法进行。定义比较功能：

def cmpDates(row):
    d1 = row.date_source1
    d2 = row.date_source2
    if pd.isna(d1):
        return d2    # No d1
    elif pd.isna(d2):
        return d1    # No d2
    elif d2 < d1:
        return d2    # d2 earlier
    elif d1.year == d2.year and d1.month == d2.month:
        return d2    # Same month
    else:
        return d1    # d1 earlier

并应用它：

df['dat'] = df.apply(cmpDates, axis=1)

也许您不知道的细节是如何处理“同一个月”案件。现在你知道了。

另一种解决方案：将比较函数定义为：

def cmpDates(row):
    d1 = row.date_source1
    d2 = row.date_source2
    if pd.isna(d1):
        return d2
    elif pd.isna(d2):
        return d1
    return d2 if d1 > d2 or d1 + pd.offsets.MonthBegin(0) ==\
        d1 + pd.offsets.MonthBegin(0) else d1

一个简短的脚本，但是它的可读性问题值得讨论。

Answer 3

使用np.where（）

df['date_source1'] = pd.to_datetime(df['date_source1'], format='%d/%m/%Y')
df['date_source2'] = pd.to_datetime(df['date_source2'], format='%d/%m/%Y')

# date_source1 is not populated
c1 = df.date_source1.isna()

# date_source2 is populated
c2 = ~df.date_source2.isna()

# date_source2 is earlier than date_source1, or they have the same Year/Month
c3 = df.date_source2.lt(df.date_source1) | df.date_source2.dt.strftime('%Y-%m').eq(df.date_source1.dt.strftime('%Y-%m'))

# combo condition by the above three
cond = c2 & (c1 | c3)

df['date_final'] = np.where(cond, df.date_source2, df.date_source1)

>>> df
   id date_source1 date_source2 date_final
0   1   2017-11-01   2017-11-15 2017-11-15
1   2   2018-03-03          NaT 2018-03-03
2   3          NaT   2018-04-04 2018-04-04
3   4   2017-10-01   2017-09-01 2017-09-01

关于日期操作python熊猫的问题

3 个答案:

使用np.where（）