Question

我按照host和operation列对DF进行了分组：

df
Out[163]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100 entries, 10069 to 1003
Data columns (total 8 columns):
args             100  non-null values
host             100  non-null values
kwargs           100  non-null values
log_timestamp    100  non-null values
operation        100  non-null values
thingy             100  non-null values
status           100  non-null values
time             100  non-null values
dtypes: float64(1), int64(2), object(5)


g = df.groupby(['host','operation'])

g
Out[165]: <pandas.core.groupby.DataFrameGroupBy object at 0x7f46ec731890>

g.groups.keys()[:10]
Out[166]:
[('yy39.segm1.org', 'gtfull'),
 ('yy39.segm1.org', 'updateWidg'),
 ('yy36.segm1.org', 'notifyTestsDelivered'),
 ('yy32.segm1.org', 'notifyTestsDelivered'),
 ('yy20.segm1.org', 'gSettings'),
 ('yy32.segm1.org', 'x_gWidgboxParams'),
 ('yy39.segm1.org', 'clearElems'),
 ('yy3.segm1.org', 'gxyzinf'),
 ('yy34.segm1.org', 'setFlagsOneWidg'),
 ('yy13.segm1.org', 'x_gbinf')]

现在我需要为每个（'主机'，'操作'）对获取单独的DataFrame。我可以通过迭代组键来完成它：

for el in g.groups.keys():
     ...:     print el, 'VALUES', g.groups[el]
     ...:
('yy25.segm1.org', 'x_gbinf') VALUES [10021]
('yy36.segm1.org', 'gxyzinf') VALUES [10074, 10085]
('yy25.segm1.org', 'updateWidg') VALUES [10022]
('yy25.segm1.org', 'gtfull') VALUES [10019]
('yy16.segm1.org', 'gxyzinf') VALUES [10052, 10055, 10062, 10064]
('yy32.segm1.org', 'addWidging2') VALUES [10034]
('yy16.segm1.org', 'notifyTestsDelivered') VALUES [10056, 10065]

问题：

Q1。我想知道我是否应该拆分DataFrameGroupBy对象，还是有更快的方法来实现这个目标？

策略性地：我需要计算指数加权移动平均线和指数加权标准差（尽管std dev应该减慢得更慢）。

为此，我需要它：

一个。按主持人分组，操作

湾每个主机/操作子集按log_timestamp

排序

℃。 ewma和ewmstd计算为time列。

有没有办法在不分割DataFrameGroupBy的情况下实现这一目标？

Q2。目标是在最后几分钟（过载情况）发出主机/操作的特定时间变为异常的信号。我有一个想法，如果我计算'慢ewmstd'和'慢ewma'（更长的时间，比如说，1小时）那么短期的ewma（比如5分钟）可以被解释为紧急值，如果它是慢速ewma（三西格玛规则）超过2个缓慢的std偏差。我甚至不确定这是否是正确/最好的方法。是吗？

可能是，因为这大致类似于UNIX 1m，5m和15m负载平均值的工作方式：如果15m正常但1m负载avg要高得多，你知道负载比平常高得多。但我不确定。

Answer 1

文档是here

你只需要：

def f(x):
     return a calculation on x

f can also be lambda x: ....

df.groupby(['host','operation']).apply(f)

将DataFrameGroupBy拆分为单个框架（Pandas）

1 个答案: