熊猫字符串以一列的值结尾,然后将字符串的开头转换为日期以进行比较

时间:2018-08-31 10:10:44

标签: python python-3.x pandas dataframe pandas-groupby

我有以下df

cluster_id    amount    inv_id            inv_date
1             309.9     07121830990       2018-07-12
1             309.9     07121830990       2018-07-12
2             3130.0    20180501313000B   2018-05-01
2             3130.0    20180501313000B   2018-05-01
3             3330.50   201804253330.50   2018-04-25             
3             3330.50   201804253330.50   2018-04-25
4             70.0      61518             2018-06-15
4             70.0      61518             2018-06-15
5             100.0     011318            2018-01-13
5             100.0     011318            2018-01-13
6             50.0      12202017          2017-12-20
6             50.0      12202017          2017-12-20
7             101.0     0000014482        2017-10-01
7             101.0     0000014482        2017-10-01

我想创建一个dummy_inv_id groupby的布尔列cluster_id,并为每个组将dummy_invoice_id设置为True

1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date

2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date 

首先,我将从inv_idgroupby cluster_id

中删除所有非数字字符
df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')

然后将amount * 100转换为字符串以方便匹配

df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str) 

例如309.9'30990'3130.0'313000',在这里我想知道如何在此处检查inv_idamount结尾的inv_id检查datetime的其余部分是否可以转换为inv_date并且在inv_id的+/- 180天内,或者071218 - 2018-07-12 20180501 - 2018-05-01 61518 - 2018-06-15 12202017 - 2017-12-20 0000014482 - cannot be converted to date 是否可以直接转换为日期。尤其是一些日期格式,即

df

结果cluster_id amount inv_id inv_date dummy_inv_id 1 309.9 07121830990 2018-07-12 True 1 309.9 07121830990 2018-07-12 True 2 3130.0 20180501313000B 2018-05-01 True 2 3130.0 20180501313000B 2018-05-01 True 3 3330.50 201804253330.50 2018-04-25 True 3 3330.50 201804253330.50 2018-04-25 True 4 70.0 61518 2018-06-15 True 4 70.0 61518 2018-06-15 True 5 100.0 011318 2018-01-13 True 5 100.0 011318 2018-01-13 True 6 50.0 12202017 2017-12-20 True 6 50.0 12202017 2017-12-20 True 7 101.0 0000014482 2017-10-01 False 7 101.0 0000014482 2017-10-01 False 看起来像

function PresaberiSve(obj){
  $("td").removeClass("active");
  $(obj).closest("td").addClass("active");
}

1 个答案:

答案 0 :(得分:1)

Idea是使用日期时间的可能格式创建的辅助字典,其中包含用于切片和列表理解转换的字母数-errors='coerce'为不匹配的值创建NaT s

from functools import reduce

#add zeros to length 6
s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]

但是某些格式应转换为错误格式,因此这些超出范围的日期时间将转换为NaT

L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

将所有非NaT值相加Series.combine_first

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0    2018-07-12
1    2018-07-12
2    2018-05-01
3    2018-05-01
4    2018-04-25
5    2018-04-25
6    2018-06-15
7    2018-06-15
8    2018-01-13
9    2018-01-13
10   2017-12-20
11   2017-12-20
12          NaT
13          NaT
Name: inv_id, dtype: datetime64[ns]

最后检查+-180天:

df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))

print (df)
    cluster_id  amount           inv_id    inv_date    new
0            1   309.9      07121830990  2018-07-12   True
1            1   309.9      07121830990  2018-07-12   True
2            2  3130.0  20180501313000B  2018-05-01   True
3            2  3130.0  20180501313000B  2018-05-01   True
4            3  3330.5  201804253330.50  2018-04-25   True
5            3  3330.5  201804253330.50  2018-04-25   True
6            4    70.0            61518  2018-06-15   True
7            4    70.0            61518  2018-06-15   True
8            5   100.0           011318  2018-01-13   True
9            5   100.0           011318  2018-01-13   True
10           6    50.0         12202017  2017-12-20   True
11           6    50.0         12202017  2017-12-20   True
12           7   101.0       0000014482  2017-10-01  False
13           7   101.0       0000014482  2017-10-01  False

编辑:

添加了从末端删除子字符串的解决方案:

import re
from functools import reduce

df['amt_str'] = (df['amount']*100).round().astype(int).astype(str) 
df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)

#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str
0            1   309.9      07121830990  2018-07-12   30990      071218
1            1   309.9      07121830990  2018-07-12   30990      071218
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425
6            4    70.0            61518  2018-06-15    7000      061518
7            4    70.0            61518  2018-06-15    7000      061518
8            5   100.0           011318  2018-01-13   10000      011318
9            5   100.0           011318  2018-01-13   10000      011318
10           6    50.0         12202017  2017-12-20    5000    12202017
11           6    50.0         12202017  2017-12-20    5000    12202017
12           7   101.0       0000014482  2017-10-01   10100  0000014482
13           7   101.0       0000014482  2017-10-01   10100  0000014482

formats = {'%m%d%y':6, 
           '%y%m%d':6,
           '%Y%m%d':8, 
           '%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]

s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
    cluster_id  amount           inv_id    inv_date amt_str     inv_str    new
0            1   309.9      07121830990  2018-07-12   30990      071218   True
1            1   309.9      07121830990  2018-07-12   30990      071218   True
2            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
3            2  3130.0  20180501313000B  2018-05-01  313000    20180501   True
4            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
5            3  3330.5  201804253330.50  2018-04-25  333050    20180425   True
6            4    70.0            61518  2018-06-15    7000      061518   True
7            4    70.0            61518  2018-06-15    7000      061518   True
8            5   100.0           011318  2018-01-13   10000      011318   True
9            5   100.0           011318  2018-01-13   10000      011318   True
10           6    50.0         12202017  2017-12-20    5000    12202017   True
11           6    50.0         12202017  2017-12-20    5000    12202017   True
12           7   101.0       0000014482  2017-10-01   10100  0000014482  False
13           7   101.0       0000014482  2017-10-01   10100  0000014482  False