Question

我正在尝试实现以下条件：如果错误值的计数大于2（在下面的示例中为2019-05-17和2019-05-20），则完整日期（所有时间段）为删除

输入

                    t_value C/IC
2019-05-17 00:00:00   0     incorrect
2019-05-17 01:00:00   0     incorrect 
2019-05-17 02:00:00   0     incorrect 
2019-05-17 03:00:00   4     correct
2019-05-17 04:00:00   5     correct 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct
2019-05-20 07:00:00   2    correct
2019-05-20 08:00:00   0    incorrect
2019-05-20 09:00:00   0    incorrect
2019-05-20 07:00:00   0    incorrect

所需的输出

                    t_value C/IC 
2019-05-18 01:00:00   0     incorrect   
2019-05-18 02:00:00   6     correct  
2019-05-18 03:00:00   7     correct 
2019-05-19 04:00:00   0     incorrect
2019-05-19 09:00:00   0    incorrect 
2019-05-19 11:00:00   8    correct

我不确定要执行哪个基于时间的操作以获得所需的结果。谢谢

Answer 1

#read in data
df = pd.read_csv(StringIO(data),sep='\s{2,}', engine='python')

#give index a name 
df.index.name = 'Date'
#convert to datetime 
#and sort index
#usually safer to sort datetime index in Pandas
df.index = pd.to_datetime(df.index)
df = df.sort_index()

res = (df
       #group by date and c/ic
       .groupby([pd.Grouper(freq='1D',level='Date'),"C/IC"])
       .size()
       #get rows greater than 2 and incorrect
       .loc[lambda x: x>2,"incorrect"]
       #keep only the date index
       .droplevel(-1)
       .index
       #datetime information trapped here
       #and due to grouping, it is different from initial datetime
       #as such, we convert to string 
       #and build another batch of dates
       .astype(str)
       .tolist()
      )

res
['2019-05-17', '2019-05-20']

#build a numpy array of dates
idx = np.array(res, dtype='datetime64')

#exclude dates in idx and get final value
#aim is to get dates, irrespective of time

df.loc[~np.isin(df.index.date,idx)]

                     t_value    C/IC
Date        
2019-05-18 01:00:00     0   incorrect
2019-05-18 02:00:00     6   correct
2019-05-18 03:00:00     7   correct
2019-05-19 04:00:00     0   incorrect
2019-05-19 09:00:00     0   incorrect
2019-05-19 11:00:00     8   correct

Answer 2

误解了问题，对不起。

更新的答案：，您可以通过以下方式找到要删除的日期：

df['_date'] = df.index.dt.date
incorrect_df = df[df['C/IC'] == 'incorrect']
incorrect_count = incorrect_df['C/IC'].groupby(by='_date').count()
dates_to_remove = set(incorrect_count[incorrect_count > 2]['_date'])
    # using set to make the later step more efficient if the df is long

然后相应地屏蔽数据框：

mask = [x not in dates_to_remove for x in df['_date']
res = df[mask]

根据python中的条件删除日期

2 个答案: