我有以下df
,
cluster_id amount inv_id inv_date
1 309.9 07121830990 2018-07-12
1 309.9 07121830990 2018-07-12
2 3130.0 20180501313000B 2018-05-01
2 3130.0 20180501313000B 2018-05-01
3 3330.50 201804253330.50 2018-04-25
3 3330.50 201804253330.50 2018-04-25
4 70.0 61518 2018-06-15
4 70.0 61518 2018-06-15
5 100.0 011318 2018-01-13
5 100.0 011318 2018-01-13
6 50.0 12202017 2017-12-20
6 50.0 12202017 2017-12-20
7 101.0 0000014482 2017-10-01
7 101.0 0000014482 2017-10-01
我想创建一个dummy_inv_id
groupby
的布尔列cluster_id
,并为每个组将dummy_invoice_id
设置为True
,
1. inv_id (stripped non-numerics) ends with amount and the remaining part of inv_id can be coerced into a valid date which is +/- 180 days of the inv_date
或
2. inv_id (stripped non-numerics) can be coerced into a date which is +/- 180 days of the inv_date
首先,我将从inv_id
和groupby
cluster_id
df['inv_id_stp'] = df.inv_id.str.replace(r'\D+', '')
grouped = df.groupby('cluster_id')
然后将amount
* 100转换为字符串以方便匹配
df['amount'] = df['amount']*100
df['amt_str'] = df['amount'].apply(str)
例如309.9
至'30990'
,3130.0
至'313000'
,在这里我想知道如何在此处检查inv_id
以amount
结尾的inv_id
检查datetime
的其余部分是否可以转换为inv_date
并且在inv_id
的+/- 180天内,或者071218 - 2018-07-12
20180501 - 2018-05-01
61518 - 2018-06-15
12202017 - 2017-12-20
0000014482 - cannot be converted to date
是否可以直接转换为日期。尤其是一些日期格式,即
df
结果cluster_id amount inv_id inv_date dummy_inv_id
1 309.9 07121830990 2018-07-12 True
1 309.9 07121830990 2018-07-12 True
2 3130.0 20180501313000B 2018-05-01 True
2 3130.0 20180501313000B 2018-05-01 True
3 3330.50 201804253330.50 2018-04-25 True
3 3330.50 201804253330.50 2018-04-25 True
4 70.0 61518 2018-06-15 True
4 70.0 61518 2018-06-15 True
5 100.0 011318 2018-01-13 True
5 100.0 011318 2018-01-13 True
6 50.0 12202017 2017-12-20 True
6 50.0 12202017 2017-12-20 True
7 101.0 0000014482 2017-10-01 False
7 101.0 0000014482 2017-10-01 False
看起来像
function PresaberiSve(obj){
$("td").removeClass("active");
$(obj).closest("td").addClass("active");
}
答案 0 :(得分:1)
Idea是使用日期时间的可能格式创建的辅助字典,其中包含用于切片和列表理解转换的字母数-errors='coerce'
为不匹配的值创建NaT
s
from functools import reduce
#add zeros to length 6
s = df.inv_id.str.replace(r'\D+', '').str.zfill(6)
formats = {'%m%d%y':6,
'%y%m%d':6,
'%Y%m%d':8,
'%m%d%Y':8}
L = [pd.to_datetime(s.str[:v], format=k, errors='coerce') for k,v in formats.items()]
但是某些格式应转换为错误格式,因此这些超出范围的日期时间将转换为NaT
:
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]
将所有非NaT值相加Series.combine_first
:
s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
print (s2)
0 2018-07-12
1 2018-07-12
2 2018-05-01
3 2018-05-01
4 2018-04-25
5 2018-04-25
6 2018-06-15
7 2018-06-15
8 2018-01-13
9 2018-01-13
10 2017-12-20
11 2017-12-20
12 NaT
13 NaT
Name: inv_id, dtype: datetime64[ns]
最后检查+-180
天:
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
cluster_id amount inv_id inv_date new
0 1 309.9 07121830990 2018-07-12 True
1 1 309.9 07121830990 2018-07-12 True
2 2 3130.0 20180501313000B 2018-05-01 True
3 2 3130.0 20180501313000B 2018-05-01 True
4 3 3330.5 201804253330.50 2018-04-25 True
5 3 3330.5 201804253330.50 2018-04-25 True
6 4 70.0 61518 2018-06-15 True
7 4 70.0 61518 2018-06-15 True
8 5 100.0 011318 2018-01-13 True
9 5 100.0 011318 2018-01-13 True
10 6 50.0 12202017 2017-12-20 True
11 6 50.0 12202017 2017-12-20 True
12 7 101.0 0000014482 2017-10-01 False
13 7 101.0 0000014482 2017-10-01 False
编辑:
添加了从末端删除子字符串的解决方案:
import re
from functools import reduce
df['amt_str'] = (df['amount']*100).round().astype(int).astype(str)
df['inv_str'] = df.inv_id.str.replace(r'\D+', '').str.zfill(6)
#https://stackoverflow.com/a/1038845/2901002
df['inv_str'] = df.apply(lambda x: re.sub('{}$'.format(x['amt_str']),'', x['inv_str']),axis=1)
print (df)
cluster_id amount inv_id inv_date amt_str inv_str
0 1 309.9 07121830990 2018-07-12 30990 071218
1 1 309.9 07121830990 2018-07-12 30990 071218
2 2 3130.0 20180501313000B 2018-05-01 313000 20180501
3 2 3130.0 20180501313000B 2018-05-01 313000 20180501
4 3 3330.5 201804253330.50 2018-04-25 333050 20180425
5 3 3330.5 201804253330.50 2018-04-25 333050 20180425
6 4 70.0 61518 2018-06-15 7000 061518
7 4 70.0 61518 2018-06-15 7000 061518
8 5 100.0 011318 2018-01-13 10000 011318
9 5 100.0 011318 2018-01-13 10000 011318
10 6 50.0 12202017 2017-12-20 5000 12202017
11 6 50.0 12202017 2017-12-20 5000 12202017
12 7 101.0 0000014482 2017-10-01 10100 0000014482
13 7 101.0 0000014482 2017-10-01 10100 0000014482
formats = {'%m%d%y':6,
'%y%m%d':6,
'%Y%m%d':8,
'%m%d%Y':8}
L=[pd.to_datetime(df['inv_str'].str[:v],format=k, errors='coerce') for k,v in formats.items()]
L = [x.where(x.between('2000-01-01', pd.datetime.now())) for x in L]
s2 = reduce(lambda l,r: pd.Series.combine_first(l,r), L)
df['new'] = s2.between(s2 - pd.Timedelta(180, unit='d'), s2 + pd.Timedelta(180, unit='d'))
print (df)
cluster_id amount inv_id inv_date amt_str inv_str new
0 1 309.9 07121830990 2018-07-12 30990 071218 True
1 1 309.9 07121830990 2018-07-12 30990 071218 True
2 2 3130.0 20180501313000B 2018-05-01 313000 20180501 True
3 2 3130.0 20180501313000B 2018-05-01 313000 20180501 True
4 3 3330.5 201804253330.50 2018-04-25 333050 20180425 True
5 3 3330.5 201804253330.50 2018-04-25 333050 20180425 True
6 4 70.0 61518 2018-06-15 7000 061518 True
7 4 70.0 61518 2018-06-15 7000 061518 True
8 5 100.0 011318 2018-01-13 10000 011318 True
9 5 100.0 011318 2018-01-13 10000 011318 True
10 6 50.0 12202017 2017-12-20 5000 12202017 True
11 6 50.0 12202017 2017-12-20 5000 12202017 True
12 7 101.0 0000014482 2017-10-01 10100 0000014482 False
13 7 101.0 0000014482 2017-10-01 10100 0000014482 False