Question

我有DataFrame包含测试运行，日期和结果。它看起来像这样：

TestName;Date;IsPassed
test1;1/31/2017 9:44:30 PM;0
test1;1/31/2017 9:39:00 PM;0
test1;1/31/2017 9:38:29 PM;1
test1;1/31/2017 9:38:27 PM;1
test2;10/31/2016 5:05:02 AM;0
test3;12/7/2016 8:58:36 PM;0
test3;12/7/2016 8:57:19 PM;0
test3;12/7/2016 8:56:15 PM;0
test4;12/5/2016 6:50:49 PM;0
test4;12/5/2016 6:49:50 PM;0
test4;12/5/2016 3:23:09 AM;1
test4;12/4/2016 11:51:29 PM;1

我希望能够找到在指定日期之前或之后没有运行的测试名称。

当然，我可以这样：

确定所有唯一的测试名称
为每个人计算出他们的最短和最长日期
根据这些行将相应的行添加到新的DataFrame

但是有什么方法可以在没有明确的循环的情况下用Pandas本地做到这一点吗？

更新

基于@jezrael的解决方案，让我们说我只想保留仅在2016年发生的测试运行。那么我必须这样做吗？

idx = test_runs.groupby('TestName').Date.agg(['idxmax']).stack().unique()
selected = test_runs.loc[idx].Date < pd.to_datetime('2017-01-01')
tests = test_runs.loc[idx].loc[selected].TestName
print(test_runs[test_runs.TestName.isin(tests)])

输出：

TestName                Date  IsPassed
4     test2 2016-10-31 05:05:02         0
5     test3 2016-12-07 20:58:36         0
6     test3 2016-12-07 20:57:19         0
7     test3 2016-12-07 20:56:15         0
8     test4 2016-12-05 18:50:49         0
9     test4 2016-12-05 18:49:50         0
10    test4 2016-12-05 03:23:09         1
11    test4 2016-12-04 23:51:29         1

Answer 1

我认为groupby agg和index需要min 对于max和Series日期的返回row值，我会idxmax，然后idxmin重新转换为test2。同样有必要删除stack对df.Date = pd.to_datetime(df.Date) idx = df.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique() print (idx) [ 3 0 4 7 5 11 8] selected = df.loc[idx] print (selected) TestName Date IsPassed 3 test1 2017-01-31 21:38:27 1 0 test1 2017-01-31 21:44:30 0 4 test2 2016-10-31 05:05:02 0 7 test3 2016-12-07 20:56:15 0 5 test3 2016-12-07 20:58:36 0 11 test4 2016-12-04 23:51:29 1 8 test4 2016-12-05 18:50:49 0个Index组的重复项。

最后按unique选择所有行：

unique

如果需要排序numpy array添加loc，因为print (df.loc[np.sort(idx)]) TestName Date IsPassed 0 test1 2017-01-31 21:44:30 0 3 test1 2017-01-31 21:38:27 1 4 test2 2016-10-31 05:05:02 0 5 test3 2016-12-07 20:58:36 0 7 test3 2016-12-07 20:56:15 0 8 test4 2016-12-05 18:50:49 0 11 test4 2016-12-04 23:51:29 1的输出为idx = test_runs.groupby('TestName').Date.agg(['idxmin','idxmax']).stack().unique() #get output to variable, then not need select twice df1 = test_runs.loc[idx] #cast to datetime is not necessary selected = df1['Date'] < '2017-01-01' #for selecting in DataFrame is used df[index_val, column_name] tests = df1.loc[selected, 'TestName'] #for better performance in large df was add unique print(test_runs[test_runs.TestName.isin(tests.unique())]) TestName Date IsPassed 4 test2 2016-10-31 05:05:02 0 5 test3 2016-12-07 20:58:36 0 6 test3 2016-12-07 20:57:19 0 7 test3 2016-12-07 20:56:15 0 8 test4 2016-12-05 18:50:49 0 9 test4 2016-12-05 18:49:50 0 10 test4 2016-12-05 03:23:09 1 11 test4 2016-12-04 23:51:29 1。

{{1}}

编辑：

您的代码看起来不错，只添加了一些改进：

{{1}}

根据日期早于

1 个答案: