我正在尝试实现以下条件:如果错误值的计数大于2(在下面的示例中为2019-05-17和2019-05-20),则完整日期(所有时间段)为删除
输入
t_value C/IC
2019-05-17 00:00:00 0 incorrect
2019-05-17 01:00:00 0 incorrect
2019-05-17 02:00:00 0 incorrect
2019-05-17 03:00:00 4 correct
2019-05-17 04:00:00 5 correct
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
2019-05-20 07:00:00 2 correct
2019-05-20 08:00:00 0 incorrect
2019-05-20 09:00:00 0 incorrect
2019-05-20 07:00:00 0 incorrect
所需的输出
t_value C/IC
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
我不确定要执行哪个基于时间的操作以获得所需的结果。谢谢
答案 0 :(得分:1)
#read in data
df = pd.read_csv(StringIO(data),sep='\s{2,}', engine='python')
#give index a name
df.index.name = 'Date'
#convert to datetime
#and sort index
#usually safer to sort datetime index in Pandas
df.index = pd.to_datetime(df.index)
df = df.sort_index()
res = (df
#group by date and c/ic
.groupby([pd.Grouper(freq='1D',level='Date'),"C/IC"])
.size()
#get rows greater than 2 and incorrect
.loc[lambda x: x>2,"incorrect"]
#keep only the date index
.droplevel(-1)
.index
#datetime information trapped here
#and due to grouping, it is different from initial datetime
#as such, we convert to string
#and build another batch of dates
.astype(str)
.tolist()
)
res
['2019-05-17', '2019-05-20']
#build a numpy array of dates
idx = np.array(res, dtype='datetime64')
#exclude dates in idx and get final value
#aim is to get dates, irrespective of time
df.loc[~np.isin(df.index.date,idx)]
t_value C/IC
Date
2019-05-18 01:00:00 0 incorrect
2019-05-18 02:00:00 6 correct
2019-05-18 03:00:00 7 correct
2019-05-19 04:00:00 0 incorrect
2019-05-19 09:00:00 0 incorrect
2019-05-19 11:00:00 8 correct
答案 1 :(得分:0)
误解了问题,对不起。
更新的答案:,您可以通过以下方式找到要删除的日期:
df['_date'] = df.index.dt.date
incorrect_df = df[df['C/IC'] == 'incorrect']
incorrect_count = incorrect_df['C/IC'].groupby(by='_date').count()
dates_to_remove = set(incorrect_count[incorrect_count > 2]['_date'])
# using set to make the later step more efficient if the df is long
然后相应地屏蔽数据框:
mask = [x not in dates_to_remove for x in df['_date']
res = df[mask]