丢弃缺少数据的组

时间:2017-08-04 17:19:17

标签: python pandas

(编辑:根据要求更新了更多数据)

我有一个包含几千行的pandas数据框,如下所示:

>>> y=x.groupby('wbdqueue_id')
>>> y.head()
           id                        jname  orderid  wbdqueue_id  platform_id  \
59    1341127    ondemand_build_baspen-w7g       15        26581    1341122.0
60    1341126       ondemand_qa_qforchecka       41        26581    1341125.0
61    1341125       ondemand_build_bchecka       17        26581    1341123.0
63    1341123    ondemand_build_baspen-w7f       14        26581    1341122.0
64    1341122   ondemand_update_waspen-w7a        2        26581    1341073.0
116   1340927              qa_db_insertall       56        26578          NaN
117   1340926      ondemand_qa_qca-pc20rha       39        26578    1340925.0
118   1340925   ondemand_build_bca-pc20rha       16        26578    1340924.0
119   1340924  ondemand_update_wca-pc20rha        3        26578    1340871.0
120   1340923    ondemand_qa_qaspen-w7_qa2       35        26578    1340922.0
173   1340870              qa_db_insertall       56        26577          NaN
174   1340869           ondemand_qa_qtopia       52        26577    1340868.0
175   1340868        ondemand_build_btopia       33        26577    1340867.0
176   1340867       ondemand_update_wtopia        9        26577    1340814.0
177   1340866            ondemand_qa_qmoed       47        26577    1340865.0
230   1340813              qa_db_insertall       56        26576          NaN
231   1340812            ondemand_qa_qmoeb       46        26576    1340811.0
232   1340811         ondemand_build_bmoed       22        26576    1340810.0
233   1340810         ondemand_build_bmoee       23        26576    1340809.0
234   1340809        ondemand_update_wmoeb        5        26576    1340757.0
287   1340756              qa_db_insertall       56        26575          NaN
293   1340750      ondemand_qa_qca-pc20rha       39        26575    1340749.0
294   1340749   ondemand_build_bca-pc20rha       16        26575    1340748.0
295   1340748  ondemand_update_wca-pc20rha        3        26575    1340700.0
296   1340747      ondemand_qa_qvmwin7-64i       55        26575    1340746.0
344   1340699              qa_db_insertall       56        26574          NaN
345   1340698          ondemand_qa_qslotha       51        26574    1340697.0
346   1340697       ondemand_build_bmousef       28        26574    1340684.0
347   1340696           ondemand_qa_qboarb       38        26574    1340695.0
348   1340695       ondemand_build_bmouseg       29        26574    1340684.0
...       ...                          ...      ...          ...          ...
9659  1327031              qa_db_insertall       56        26311          NaN
9660  1327030       ondemand_qa_qforchecka       41        26311    1327029.0
9661  1327029       ondemand_build_bchecka       17        26311    1327027.0
9662  1327028    ondemand_qa_qaspen-w7_qa1       34        26311    1327027.0
9663  1327027    ondemand_build_baspen-w7f       14        26311    1327024.0
9716  1326974              qa_db_insertall       56        26310          NaN
9717  1326973            ondemand_qa_qmoeb       46        26310    1326972.0
9718  1326972         ondemand_build_bmoed       22        26310    1326971.0
9719  1326971         ondemand_build_bmoee       23        26310    1326970.0
9720  1326970        ondemand_update_wmoeb        5        26310    1326918.0
9773  1326917              qa_db_insertall       56        26309          NaN
9774  1326916           ondemand_qa_qtopia       52        26309    1326915.0
9775  1326915        ondemand_build_btopia       33        26309    1326914.0
9776  1326914       ondemand_update_wtopia        9        26309    1326861.0
9777  1326913    ondemand_qa_qaspen-w7_qa2       35        26309    1326912.0
9830  1326860              qa_db_insertall       56        26308          NaN
9831  1326859    ondemand_build_balder-w7d       12        26308    1326852.0
9832  1326858      ondemand_qa_qvmwin7-64i       55        26308    1326857.0
9833  1326857    ondemand_build_balder-w7h       13        26308    1326852.0
9834  1326856      ondemand_qa_qvmwin7-64d       54        26308    1326855.0
9887  1326803              qa_db_insertall       56        26307          NaN
9888  1326802    ondemand_qa_qaspen-w7_qa1       34        26307    1326799.0
9889  1326801       ondemand_qa_qforchecka       41        26307    1326800.0
9890  1326800       ondemand_build_bchecka       17        26307    1326799.0
9891  1326799    ondemand_build_baspen-w7f       14        26307    1326796.0
9944  1326746              qa_db_insertall       56        26306          NaN
9950  1326740         ondemand_qa_qkrakena       43        26306    1326737.0
9951  1326739          ondemand_qa_qkirina       42        26306    1326738.0
9952  1326738       ondemand_build_bkirina       18        26306    1326737.0
9953  1326737      ondemand_build_bkrakena       19        26306    1326736.0

           startdatetime         enddatetime  runtime
59   2017-07-31 23:14:56 2017-07-31 23:19:12 00:04:16
60   2017-07-31 23:15:35 2017-07-31 23:34:12 00:18:37
61   2017-07-31 23:14:56 2017-07-31 23:15:30 00:00:34
63   2017-07-31 23:10:05 2017-07-31 23:14:56 00:04:51
64   2017-07-31 23:09:32 2017-07-31 23:10:00 00:00:28
116  2017-07-31 21:42:28 2017-07-31 21:42:55 00:00:27
117  2017-07-31 21:10:15 2017-07-31 21:17:46 00:07:31
118  2017-07-31 21:09:37 2017-07-31 21:10:10 00:00:33
119  2017-07-31 21:09:22 2017-07-31 21:09:32 00:00:10
120  2017-07-31 21:17:57 2017-07-31 21:33:22 00:15:25
173  2017-07-31 21:05:15 2017-07-31 21:05:17 00:00:02
174  2017-07-31 20:27:07 2017-07-31 20:27:19 00:00:12
175  2017-07-31 20:27:00 2017-07-31 20:27:02 00:00:02
176  2017-07-31 20:26:52 2017-07-31 20:26:54 00:00:02
177  2017-07-31 20:48:56 2017-07-31 20:48:59 00:00:03
230  2017-07-31 21:04:50 2017-07-31 21:05:15 00:00:25
231  2017-07-31 20:20:02 2017-07-31 20:23:24 00:03:22
232  2017-07-31 20:19:43 2017-07-31 20:19:57 00:00:14
233  2017-07-31 20:19:11 2017-07-31 20:19:38 00:00:27
234  2017-07-31 20:17:26 2017-07-31 20:18:49 00:01:23
287  2017-07-31 20:58:36 2017-07-31 20:59:00 00:00:24
293  2017-07-31 20:08:57 2017-07-31 20:16:33 00:07:36
294  2017-07-31 20:07:44 2017-07-31 20:08:52 00:01:08
295  2017-07-31 20:07:06 2017-07-31 20:07:39 00:00:33
296  2017-07-31 20:36:23 2017-07-31 20:36:58 00:00:35
344  2017-07-31 20:38:11 2017-07-31 20:38:35 00:00:24
345  2017-07-31 19:58:32 2017-07-31 20:00:04 00:01:32
346  2017-07-31 19:57:48 2017-07-31 19:58:26 00:00:38
347  2017-07-31 19:59:17 2017-07-31 20:04:59 00:05:42
348  2017-07-31 19:58:26 2017-07-31 19:59:12 00:00:46
...                  ...                 ...      ...
9659 2017-07-18 18:44:05 2017-07-18 18:44:29 00:00:24
9660 2017-07-18 18:08:17 2017-07-18 18:25:53 00:17:36
9661 2017-07-18 18:07:35 2017-07-18 18:08:12 00:00:37
9662 2017-07-18 18:07:40 2017-07-18 18:15:01 00:07:21
9663 2017-07-18 18:03:20 2017-07-18 18:07:35 00:04:15
9716 2017-07-18 18:10:53 2017-07-18 18:11:15 00:00:22
9717 2017-07-18 17:40:52 2017-07-18 17:44:15 00:03:23
9718 2017-07-18 17:40:29 2017-07-18 17:40:48 00:00:19
9719 2017-07-18 17:39:51 2017-07-18 17:40:25 00:00:34
9720 2017-07-18 17:37:42 2017-07-18 17:39:23 00:01:41
9773 2017-07-18 16:45:15 2017-07-18 16:45:39 00:00:24
9774 2017-07-18 16:02:44 2017-07-18 16:06:24 00:03:40
9775 2017-07-18 16:02:09 2017-07-18 16:02:39 00:00:30
9776 2017-07-18 16:01:55 2017-07-18 16:02:04 00:00:09
9777 2017-07-18 16:10:46 2017-07-18 16:26:18 00:15:32
9830 2017-07-18 16:10:22 2017-07-18 16:10:46 00:00:24
9831 2017-07-18 15:41:47 2017-07-18 15:44:07 00:02:20
9832 2017-07-18 16:01:19 2017-07-18 16:01:55 00:00:36
9833 2017-07-18 15:44:07 2017-07-18 15:46:02 00:01:55
9834 2017-07-18 15:41:52 2017-07-18 15:42:36 00:00:44
9887 2017-07-18 15:02:56 2017-07-18 15:03:20 00:00:24
9888 2017-07-18 14:34:26 2017-07-18 14:41:15 00:06:49
9889 2017-07-18 14:35:03 2017-07-18 14:51:17 00:16:14
9890 2017-07-18 14:34:21 2017-07-18 14:34:58 00:00:37
9891 2017-07-18 14:30:03 2017-07-18 14:34:22 00:04:19
9944 2017-07-18 13:50:01 2017-07-18 13:50:25 00:00:24
9950 2017-07-18 13:18:38 2017-07-18 13:39:37 00:20:59
9951 2017-07-18 13:19:35 2017-07-18 13:24:42 00:05:07
9952 2017-07-18 13:18:38 2017-07-18 13:19:30 00:00:52
9953 2017-07-18 13:17:16 2017-07-18 13:18:33 00:01:17

我想为每个wbdqueue_id获取ondemand_build_baspen-w7g的ondemand_update_waspen-w7a和 enddatetime startdatetime 之间的区别。在Stackoverflow的帮助下,我学会了如何处理groupby和apply(myfunc)或lambda。 以下是提出的解决方案,其中任何一个都可以很好地工作,如果不是缺少数据:

df.set_index('jname').groupby('wbdqueue_id').apply(
    lambda x: x.at['ondemand_update_waspen-w7a', 'startdatetime'] \
            - x.at['ondemand_build_baspen-w7f', 'enddatetime'] )

def get_time_diff(dff):
    start_time = dff[dff.jname.eq('ondemand_update_waspen-w7a')].startdatetime.values[0]
    end_time = dff[dff.jname.eq('ondemand_build_baspen-w7g')].enddatetime.values[0]
    return pd.Timedelta(end_time - start_time)

但似乎并非所有组都有ondemand_update_waspen-w7a和/或ondemand_build_baspen-w7g。这会导致上述功能失败。 如何跳过或删除没有我需要的数据的组?似乎没有一种简单的方法可以做到这一点。

1 个答案:

答案 0 :(得分:1)

通过添加try-except来处理缺少指定字符串的组来调整函数:

def get_time_diff(dff):
    try:
        start_time = dff[dff.jname.eq('ondemand_update_waspen-w7a')].startdatetime.values[0]
        end_time = dff[dff.jname.eq('ondemand_build_baspen-w7g')].enddatetime.values[0]
        return pd.Timedelta(end_time - start_time)
    except (KeyError, IndexError):
        return 0

编辑:

出于调试目的,我在执行索引的函数部分周围放置了try-except语句。如果您正在使用的字符串丢失,则每个人都会报告。我还编辑了函数参数,以确保您可以选择切换字符串。这将允许您在函数中输入任何随机字符串以进行测试:

def get_time_diff2(dff, str1='ondemand_update_waspen-w7a', str2='ondemand_build_baspen-w7g'):
    missing = ""
    try:
        start_time = dff[dff.jname.eq(str1)].startdatetime.values[0]
    except (KeyError, IndexError):
        missing += "{} is not in startdatetime. ".format(str1)
    try:
        end_time = dff[dff.jname.eq(str2)].enddatetime.values[0]
    except (KeyError, IndexError):
        missing += "{} is not in enddatetime".format(str2)
    if missing:
        return missing
    return pd.Timedelta(end_time - start_time)

如果您对数据框和两个字符串使用此函数,例如" hello"和" world",它会告诉您是否在您应用get_time_diff功能的特定组中缺少这两个或两个。请参阅以下内容:

df.groupby('wbdqueue_id').apply(get_time_diff2, 'hello', 'world')

应该返回:

# wbdqueue_id
# 26578    hello is not in startdatetime. world is not in enddatetime
# 26581    hello is not in startdatetime. world is not in enddatetime
# dtype: object

如果只缺少其中一个:

df.groupby('wbdqueue_id').apply(get_time_diff2, 'ondemand_update_waspen-w7a', 'hello')

应该返回:

wbdqueue_id
26578    hello is not in enddatetime
26581    hello is not in enddatetime
dtype: object

我希望这会有所帮助。