以下数据框,其中包含同一公司在不同日期(列日期)的数据(列ID)。我想删除少于3天的观察结果。
起始数据集是
df = pd.DataFrame({"ID":{"0":1,"1":1,"2":1,"3":1,"4":4,"5":4,"6":4,"7":2,"8":2,"9":3,"10":3},
"date":{"0":1421020800000,"1":1421193600000,"2":1422489600000,"3":1423353600000,"4":1421020800000,"5":1421107200000,"6":1421193600000,"7":1421020800000,"8":1421107200000,"9":1421452800000,"10":1421539200000},
"variable":{"0":28,"1":62,"2":60,"3":72,"4":28,"5":61,"6":62,"7":23,"8":70,"9":32,"10":55}})
df.date = pd.to_datetime(df.date, unit='ms')
df.sort_values(by=["ID", "date"],inplace=True)
在上述数据框中,只有ID = 4的公司才能满足要求,我想删除其他公司。
我写了以下代码,但是有一个明显的问题,我不知道如何解决:
df['delete'] = 0
for name, group in df.groupby(by = "ID"):
if group.shape[0] < 3:
df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]
以上代码保留了ID = 1和ID = 4的两家公司;应该取消ID = 1,因为它包含4个数据点,但其中最多两个是连续的天(而我想施加至少3个)。
任何帮助将不胜感激。谢谢
答案 0 :(得分:0)
IIUC使用diff
+ cumsum
和date
列创建组密钥New,然后我们只使用groupby
+ filter
不需要的组
df['New']=df.groupby('ID').date.apply(lambda x : x.diff().dt.days.ne(1).cumsum())
yourdf=df.groupby(['ID','New']).filter(lambda x : len(x)>=3)
yourdf
Out[809]:
ID date variable New
4 4 2015-01-12 28 1
5 4 2015-01-13 61 1
6 4 2015-01-14 62 1
答案 1 :(得分:0)
我认为您可以使用3天移动窗口并计数项目来替换“ group.shape [0]”。
df = pd.DataFrame({"ID":{"0":1,"1":1,"2":1,"3":1,"4":4,"5":4,"6":4,"7":2,"8":2,"9":3,"10":3},
"date":{"0":1421020800000,"1":1421193600000,"2":1422489600000,"3":1423353600000,"4":1421020800000,"5":1421107200000,"6":1421193600000,"7":1421020800000,"8":1421107200000,"9":1421452800000,"10":1421539200000},
"variable":{"0":28,"1":62,"2":60,"3":72,"4":28,"5":61,"6":62,"7":23,"8":70,"9":32,"10":55}})
df.date = pd.to_datetime(df.date, unit='ms')
df.sort_values(by=["ID", "date"],inplace=True)
df['delete'] = 0
for name, group in df.groupby(by = "ID"):
group.set_index('date',inplace=True)
if group.rolling(window='3D',min_periods=0).count()['delete'].max() < 3:
df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]
答案 2 :(得分:0)
df['delete'] = 0
for name, group in df.groupby(by = "ID"):
if group.shape[0] != 3:
df.loc[df['ID']==name,'delete'] = 1
df = df.loc[df['delete'] == 0,:]
您可能在if group.shape[0] != 3
中设置了错误