pandas DataFrame - 获取持续时间的小时数

时间:2013-09-08 01:38:05

标签: python pandas

我是熊猫的相对新手,我不知道如何处理这个问题。我正在通过Help Desk系统分析故障单流。原始数据看起来像这样(有更多列,有时跨越几天):

    TicketNo SvcGroup           CreatedAt                   ClosedAt
0    4237941     Unix 2013-07-28 03:55:00 2013-07-28 11:01:37.346438
1    4238041  Windows 2013-07-28 04:59:00 2013-07-28 18:25:02.193182
2    4238051  Windows 2013-07-28 05:09:00 2013-07-28 23:11:12.003673
3    4238291  Windows 2013-07-28 05:10:00 2013-07-28 05:32:51.547251
4    4238321     Unix 2013-07-28 01:15:00        2013-07-28 10:09:20
5    4238331     Unix 2013-07-28 01:53:00 2013-07-28 17:42:56.192088
6    4238561  Windows 2013-07-28 02:03:00 2013-07-28 06:34:09.455042
7    4238691  Windows 2013-07-28 02:03:00 2013-07-28 20:54:47.306731
8    4238811  Windows 2013-07-28 03:23:00 2013-07-28 13:15:20.823505
9    4238851  Windows 2013-07-28 04:16:00 2013-07-28 23:51:55.561463
10   4239011     Unix 2013-07-28 04:26:00 2013-07-28 09:27:06.275342
11   4239041  Windows 2013-07-28 04:38:00 2013-07-28 07:55:34.416621
12   4239131     Unix 2013-07-28 08:15:00 2013-07-28 08:46:42.380739
13   4239141  Windows 2013-07-28 01:08:00 2013-07-28 15:37:12.266341

我希望按小时查看数据,看看票据如何通过转移流经帮助台 - 所以中间步骤可能是这样的:

                        Opened  Open  Closed  CarryFwd
TicketNo SvcGroup Hour
4237941  Unix     3          1     1       0         1
                  4          0     1       0         1
                  5          0     1       0         1
                  6          0     1       0         1
                  7          0     1       0         1
                  8          0     1       0         1
                  9          0     1       0         1
                  10         0     1       0         1
                  11         0     1       1         0
4239041  Windows  4          1     1       0         1
                  5          0     1       0         1
                  6          0     1       0         1
                  7          0     1       1         0

最终结果如(从上面分组):

               Opened  Closed  CarryFwd
SvcGroup Hour
Unix     3          6       7        47
         4          7      10        44
         5          1       6        39
         6         11       2        48
         7          7       3        52
         8          5       5        52
         9          5      11        46
Windows  3          6       7        22
         4          3      10        15
         5          5       2        18
         6          6       2        22
         7         11      11        22
         8          2       4        20
         9          0       2        18   

注意:这是按小时分解的,但我可以按天,周等等来查看。一旦我了解上述内容,我就可以判断一个服务组是否正在取得进展,落后等等

有关如何处理此问题的任何想法?我真正无法弄清楚的部分是如何将CreatedAt带到ClosedAt持续时间并按离散时间间隔(小时等)将其分解......

非常感谢任何指导。感谢。

2 个答案:

答案 0 :(得分:0)

这只是部分答案。

读入您的数据,注意必须合并2个日期/时间列

In [75]: df = read_csv(StringIO(data),sep='\s+',skiprows=1,parse_dates=[[3,4],[5,6]],header=None)

In [76]: df.columns = ['created','closed','idx','num','typ']

In [77]: df
Out[77]: 
               created                     closed  idx      num      typ
0  2013-07-28 03:55:00 2013-07-28 11:01:37.346438    0  4237941     Unix
1  2013-07-28 04:59:00 2013-07-28 18:25:02.193182    1  4238041  Windows
2  2013-07-28 05:09:00 2013-07-28 23:11:12.003673    2  4238051  Windows
3  2013-07-28 05:10:00 2013-07-28 05:32:51.547251    3  4238291  Windows
4  2013-07-28 01:15:00        2013-07-28 10:09:20    4  4238321     Unix
5  2013-07-28 01:53:00 2013-07-28 17:42:56.192088    5  4238331     Unix
6  2013-07-28 02:03:00 2013-07-28 06:34:09.455042    6  4238561  Windows
7  2013-07-28 02:03:00 2013-07-28 20:54:47.306731    7  4238691  Windows
8  2013-07-28 03:23:00 2013-07-28 13:15:20.823505    8  4238811  Windows
9  2013-07-28 04:16:00 2013-07-28 23:51:55.561463    9  4238851  Windows
10 2013-07-28 04:26:00 2013-07-28 09:27:06.275342   10  4239011     Unix
11 2013-07-28 04:38:00 2013-07-28 07:55:34.416621   11  4239041  Windows
12 2013-07-28 08:15:00 2013-07-28 08:46:42.380739   12  4239131     Unix
13 2013-07-28 01:08:00 2013-07-28 15:37:12.266341   13  4239141  Windows

In [78]: df.dtypes
Out[78]: 
created    datetime64[ns]
closed     datetime64[ns]
idx                 int64
num                 int64
typ                object
dtype: object

对于每个偶数,将1放在范围内(创建 - 关闭)。用0表示填充nan。

In [82]: m = df.apply(lambda x: Series(1,index=np.arange(x['created'].hour,x['closed'].hour+1)),axis=1).fillna(0)

In [81]: m
Out[81]: 
    1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23
0    0   0   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0
1    0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0
2    0   0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
3    0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4    1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0
5    1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0
6    0   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
7    0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0
8    0   0   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
9    0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
10   0   0   0   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
11   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
12   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
13   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0

将其加入原始数据集并设置索引

在[83]中:y = df [['num','typ']]。join(m).set_index(['num','typ'])

In [84]: y
Out[84]: 
                 1   2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  20  21  22  23
num     typ                                                                                                
4237941 Unix      0   0   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0
4238041 Windows   0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0
4238051 Windows   0   0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
4238291 Windows   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4238321 Unix      1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0
4238331 Unix      1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0
4238561 Windows   0   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4238691 Windows   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0
4238811 Windows   0   0   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0
4238851 Windows   0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
4239011 Unix      0   0   0   1   1   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4239041 Windows   0   0   0   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4239131 Unix      0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
4239141 Windows   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   0   0   0   0   0   0   0   0

此时你可以进行计算

打开/关闭是简单的边缘检测。 Carry Fwd只是m.where(m==1)

答案 1 :(得分:0)

这是另一种方式......

创建一个占用一行的函数并创建以下相应的DataFrame:

def sparse_opened_closed(row):
    opened_hour, closed_hour = row['CreatedAt'].hour, row['ClosedAt'].hour
    hours = xrange(opened_hour, closed_hour + 1)
    index = pd.MultiIndex.from_tuples((row['TicketNo'], row['SvcGroup'], h) for h in hours])
    opened, closed = np.zeros_like(hours), np.zeros_like(hours)
    opened[0], closed[-1] = 1, 1
    open, carry = np.ones_like(hours), np.ones_like(hours)
    carry[-1] = 0
    return pd.DataFrame({'Opened': opened, 'Open': open, 'Closed': closed, 'CarryFwd': carry}, index=index)

你当然可以提高效率。

现在,遍历每一行并连续:

In [11]: pd.concat(sparse_opened_closed(row) for _, row in df.iterrows()).head(10)
Out[11]:
                    CarryFwd  Closed  Open  Opened
4237941 Unix    3          1       0     1       1
                4          1       0     1       0
                5          1       0     1       0
                6          1       0     1       0
                7          1       0     1       0
                8          1       0     1       0
                9          1       0     1       0
                10         1       0     1       0
                11         0       1     1       0
4238041 Windows 4          1       0     1       1