我有一个数据框
id date num
A 20170301 1
A 20170302 4
A 20170304 2
C 20170302 3
B 20170303 2
C 20170305 0
B 20170304 5
我该怎么做才能得到
date_A_list = ['20170301','20170302','20170304']
date_B_list = ['20170303','20170304']
date_C_list = ['20170302','20170305']
以便我检查的每个列表
rng = pd.date_range(df.date.min(),df.date.max())
当我通过时
diff = rng.difference(date_A_list)
我得到了
diff = ['20170303']
我的数据框看起来像这样
id date num num_sum len_diff
A 20170301 1 7 1
A 20170302 4 7 1
A 20170304 2 7 1
B 20170303 2 7 0
B 20170304 5 7 0
C 20170304 3 3 2
C 20170305 0 3 2
答案 0 :(得分:2)
将sort_values
与GroupBy.transform
和map
一起使用:
#convert to datetimes
df.date = pd.to_datetime(df.date, format='%Y%m%d')
#groupby + resample by days - get NaNs for missing dates
d1 = df.set_index('date').groupby('id').resample('d')['id'].first()
print (d1)
id date
A 2017-03-01 A
2017-03-02 A
2017-03-03 NaN
2017-03-04 A
B 2017-03-03 B
2017-03-04 B
C 2017-03-02 C
2017-03-03 NaN
2017-03-04 NaN
2017-03-05 C
Name: id, dtype: object
#count NaNs
s = d1.isnull().sum(level=0).astype(int)
print (s)
id
A 1
B 0
C 2
Name: id, dtype: int32
df = df.sort_values('id')
df['num_sum'] = df.groupby('id')['num'].transform('sum')
df['len_diff'] = df['id'].map(s)
print (df)
id date num num_sum len_diff
0 A 2017-03-01 1 7 1
1 A 2017-03-02 4 7 1
2 A 2017-03-04 2 7 1
4 B 2017-03-03 2 7 0
6 B 2017-03-04 5 7 0
3 C 2017-03-02 3 3 2
5 C 2017-03-05 0 3 2
另一种具有自定义功能的解决方案:
df.date = pd.to_datetime(df.date, format='%Y%m%d')
def f(x):
rng = pd.date_range(x.min(),x.max())
return len(rng.difference(x))
df = df.sort_values('id')
df['num_sum'] = df.groupby('id')['num'].transform('sum')
df['len_diff'] = df.groupby('id')['date'].transform(f)
print (df)
id date num num_sum len_diff
0 A 2017-03-01 1 7 1
1 A 2017-03-02 4 7 1
2 A 2017-03-04 2 7 1
4 B 2017-03-03 2 7 0
6 B 2017-03-04 5 7 0
3 C 2017-03-02 3 3 2
5 C 2017-03-05 0 3 2