根据组切割选定列的行

时间:2017-11-10 06:21:40

标签: python pandas

我有一个数据框

id   date      num
A    20170301  1
A    20170302  4
A    20170304  2
C    20170302  3
B    20170303  2  
C    20170305  0
B    20170304  5

我该怎么做才能得到

date_A_list = ['20170301','20170302','20170304']
date_B_list = ['20170303','20170304']
date_C_list = ['20170302','20170305']

以便我检查的每个列表

rng = pd.date_range(df.date.min(),df.date.max())

当我通过时

diff = rng.difference(date_A_list)

我得到了

diff = ['20170303']

我的数据框看起来像这样

id   date      num  num_sum    len_diff
A    20170301  1    7          1
A    20170302  4    7          1
A    20170304  2    7          1
B    20170303  2    7          0
B    20170304  5    7          0
C    20170304  3    3          2
C    20170305  0    3          2

1 个答案:

答案 0 :(得分:2)

sort_valuesGroupBy.transformmap一起使用:

#convert to datetimes
df.date = pd.to_datetime(df.date, format='%Y%m%d')

#groupby + resample by days - get NaNs for missing dates
d1 = df.set_index('date').groupby('id').resample('d')['id'].first()
print (d1)
id  date      
A   2017-03-01      A
    2017-03-02      A
    2017-03-03    NaN
    2017-03-04      A
B   2017-03-03      B
    2017-03-04      B
C   2017-03-02      C
    2017-03-03    NaN
    2017-03-04    NaN
    2017-03-05      C
Name: id, dtype: object

#count NaNs
s = d1.isnull().sum(level=0).astype(int)
print (s)
id
A    1
B    0
C    2
Name: id, dtype: int32
df = df.sort_values('id')
df['num_sum'] = df.groupby('id')['num'].transform('sum')
df['len_diff'] = df['id'].map(s)
print (df)
  id       date  num  num_sum  len_diff
0  A 2017-03-01    1        7         1
1  A 2017-03-02    4        7         1
2  A 2017-03-04    2        7         1
4  B 2017-03-03    2        7         0
6  B 2017-03-04    5        7         0
3  C 2017-03-02    3        3         2
5  C 2017-03-05    0        3         2

另一种具有自定义功能的解决方案:

df.date = pd.to_datetime(df.date, format='%Y%m%d')

def f(x):
    rng = pd.date_range(x.min(),x.max())
    return len(rng.difference(x))

df = df.sort_values('id')
df['num_sum'] = df.groupby('id')['num'].transform('sum')
df['len_diff'] = df.groupby('id')['date'].transform(f)
print (df)
  id       date  num  num_sum  len_diff
0  A 2017-03-01    1        7         1
1  A 2017-03-02    4        7         1
2  A 2017-03-04    2        7         1
4  B 2017-03-03    2        7         0
6  B 2017-03-04    5        7         0
3  C 2017-03-02    3        3         2
5  C 2017-03-05    0        3         2