我是熊猫的相对新手,我不知道如何处理这个问题。我正在通过Help Desk系统分析故障单流。原始数据看起来像这样(有更多列,有时跨越几天):
TicketNo SvcGroup CreatedAt ClosedAt
0 4237941 Unix 2013-07-28 03:55:00 2013-07-28 11:01:37.346438
1 4238041 Windows 2013-07-28 04:59:00 2013-07-28 18:25:02.193182
2 4238051 Windows 2013-07-28 05:09:00 2013-07-28 23:11:12.003673
3 4238291 Windows 2013-07-28 05:10:00 2013-07-28 05:32:51.547251
4 4238321 Unix 2013-07-28 01:15:00 2013-07-28 10:09:20
5 4238331 Unix 2013-07-28 01:53:00 2013-07-28 17:42:56.192088
6 4238561 Windows 2013-07-28 02:03:00 2013-07-28 06:34:09.455042
7 4238691 Windows 2013-07-28 02:03:00 2013-07-28 20:54:47.306731
8 4238811 Windows 2013-07-28 03:23:00 2013-07-28 13:15:20.823505
9 4238851 Windows 2013-07-28 04:16:00 2013-07-28 23:51:55.561463
10 4239011 Unix 2013-07-28 04:26:00 2013-07-28 09:27:06.275342
11 4239041 Windows 2013-07-28 04:38:00 2013-07-28 07:55:34.416621
12 4239131 Unix 2013-07-28 08:15:00 2013-07-28 08:46:42.380739
13 4239141 Windows 2013-07-28 01:08:00 2013-07-28 15:37:12.266341
我希望按小时查看数据,看看票据如何通过转移流经帮助台 - 所以中间步骤可能是这样的:
Opened Open Closed CarryFwd
TicketNo SvcGroup Hour
4237941 Unix 3 1 1 0 1
4 0 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 0 1
8 0 1 0 1
9 0 1 0 1
10 0 1 0 1
11 0 1 1 0
4239041 Windows 4 1 1 0 1
5 0 1 0 1
6 0 1 0 1
7 0 1 1 0
最终结果如(从上面分组):
Opened Closed CarryFwd
SvcGroup Hour
Unix 3 6 7 47
4 7 10 44
5 1 6 39
6 11 2 48
7 7 3 52
8 5 5 52
9 5 11 46
Windows 3 6 7 22
4 3 10 15
5 5 2 18
6 6 2 22
7 11 11 22
8 2 4 20
9 0 2 18
注意:这是按小时分解的,但我可以按天,周等等来查看。一旦我了解上述内容,我就可以判断一个服务组是否正在取得进展,落后等等
有关如何处理此问题的任何想法?我真正无法弄清楚的部分是如何将CreatedAt带到ClosedAt持续时间并按离散时间间隔(小时等)将其分解......
非常感谢任何指导。感谢。
答案 0 :(得分:0)
这只是部分答案。
读入您的数据,注意必须合并2个日期/时间列
In [75]: df = read_csv(StringIO(data),sep='\s+',skiprows=1,parse_dates=[[3,4],[5,6]],header=None)
In [76]: df.columns = ['created','closed','idx','num','typ']
In [77]: df
Out[77]:
created closed idx num typ
0 2013-07-28 03:55:00 2013-07-28 11:01:37.346438 0 4237941 Unix
1 2013-07-28 04:59:00 2013-07-28 18:25:02.193182 1 4238041 Windows
2 2013-07-28 05:09:00 2013-07-28 23:11:12.003673 2 4238051 Windows
3 2013-07-28 05:10:00 2013-07-28 05:32:51.547251 3 4238291 Windows
4 2013-07-28 01:15:00 2013-07-28 10:09:20 4 4238321 Unix
5 2013-07-28 01:53:00 2013-07-28 17:42:56.192088 5 4238331 Unix
6 2013-07-28 02:03:00 2013-07-28 06:34:09.455042 6 4238561 Windows
7 2013-07-28 02:03:00 2013-07-28 20:54:47.306731 7 4238691 Windows
8 2013-07-28 03:23:00 2013-07-28 13:15:20.823505 8 4238811 Windows
9 2013-07-28 04:16:00 2013-07-28 23:51:55.561463 9 4238851 Windows
10 2013-07-28 04:26:00 2013-07-28 09:27:06.275342 10 4239011 Unix
11 2013-07-28 04:38:00 2013-07-28 07:55:34.416621 11 4239041 Windows
12 2013-07-28 08:15:00 2013-07-28 08:46:42.380739 12 4239131 Unix
13 2013-07-28 01:08:00 2013-07-28 15:37:12.266341 13 4239141 Windows
In [78]: df.dtypes
Out[78]:
created datetime64[ns]
closed datetime64[ns]
idx int64
num int64
typ object
dtype: object
对于每个偶数,将1放在范围内(创建 - 关闭)。用0表示填充nan。
In [82]: m = df.apply(lambda x: Series(1,index=np.arange(x['created'].hour,x['closed'].hour+1)),axis=1).fillna(0)
In [81]: m
Out[81]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
6 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
8 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
9 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
10 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
将其加入原始数据集并设置索引
在[83]中:y = df [['num','typ']]。join(m).set_index(['num','typ'])
In [84]: y
Out[84]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
num typ
4237941 Unix 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
4238041 Windows 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0
4238051 Windows 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4238291 Windows 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4238321 Unix 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
4238331 Unix 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
4238561 Windows 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4238691 Windows 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
4238811 Windows 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4238851 Windows 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4239011 Unix 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4239041 Windows 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4239131 Unix 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4239141 Windows 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
此时你可以进行计算
打开/关闭是简单的边缘检测。 Carry Fwd只是m.where(m==1)
答案 1 :(得分:0)
这是另一种方式......
创建一个占用一行的函数并创建以下相应的DataFrame:
def sparse_opened_closed(row):
opened_hour, closed_hour = row['CreatedAt'].hour, row['ClosedAt'].hour
hours = xrange(opened_hour, closed_hour + 1)
index = pd.MultiIndex.from_tuples((row['TicketNo'], row['SvcGroup'], h) for h in hours])
opened, closed = np.zeros_like(hours), np.zeros_like(hours)
opened[0], closed[-1] = 1, 1
open, carry = np.ones_like(hours), np.ones_like(hours)
carry[-1] = 0
return pd.DataFrame({'Opened': opened, 'Open': open, 'Closed': closed, 'CarryFwd': carry}, index=index)
你当然可以提高效率。
现在,遍历每一行并连续:
In [11]: pd.concat(sparse_opened_closed(row) for _, row in df.iterrows()).head(10)
Out[11]:
CarryFwd Closed Open Opened
4237941 Unix 3 1 0 1 1
4 1 0 1 0
5 1 0 1 0
6 1 0 1 0
7 1 0 1 0
8 1 0 1 0
9 1 0 1 0
10 1 0 1 0
11 0 1 1 0
4238041 Windows 4 1 0 1 1