Question

这是更复杂的实际应用程序的简短完整示例。

已使用的库：

import numpy as np
import scipy as sp
import scipy.stats as scist
import matplotlib.pyplot as plt
from itertools import zip_longest

数据：

我有一个数组，其中的数组以不规则的bin开头和结尾，例如这样（在实际情况下，此格式是给定的，因为它是另一个进程的输出）：< / p>

bin_starts = np.array([0, 93, 184, 277, 368])
bin_ends = np.array([89, 178, 272, 363, 458])

与之结合的

bns = np.stack(zip_longest(bin_starts, bin_ends)).flatten()
bns
>>> array([  0,  89,  93, 178, 184, 272, 277, 363, 368, 458])

给出长短间隔的规则交替序列，所有长度都是不规则的。这是给定的长间隔和短间隔的示意图：

我有一堆时间序列数据，类似于下面创建的随机数据：

# make some random example data to bin
np.random.seed(45)
x = np.arange(0,460)
y = 5+np.random.randn(460).cumsum()
plt.plot(x,y);

目标：

我想使用间隔序列来收集有关数据的统计信息（均值，百分位数， etcetera ），但只能使用较长的间隔，即草图中的黄色间隔。

假设和澄清：

长间隔的边缘永远不会重叠；换句话说，在长间隔之间总是存在一个短间隔。另外，第一个间隔总是很长。

当前解决方案：

一种方法是在所有时间间隔上使用scipy.stats.binned_statistic，然后对结果进行切片以仅保留其他所有内容（即[::2]），这样（对于某些统计信息有很大帮助，例如np.percentile，正在由this SO answer阅读@ali_m）：

ave = scist.binned_statistic(x, y, 
                         statistic = np.nanmean, 
                         bins=bns)[0][::2]

这给了我想要的结果：

plt.plot(np.arange(0,5), ave);

问题：是否有更Python化的方法（使用Numpy，Scipy或Pandas中的任何方法）？

Answer 1

我认为使用<?xml version="1.0" encoding="utf-8"?> <soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/08/addressing" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd"> <soap:Header> <wsa:Action>http://schemas.xmlsoap.org/ws/2004/08/addressing/fault</wsa:Action> <wsa:MessageID>urn:uuid:2a38012a-5ae0-4077-8b81-08a808c62e60</wsa:MessageID> <wsa:RelatesTo>urn:uuid:6c1f1484-c546-49be-b181-4d75f5dd08a5</wsa:RelatesTo> <wsa:To>http://schemas.xmlsoap.org/ws/2004/08/addressing/role/anonymous</wsa:To> <wsse:Security> <wsu:Timestamp wsu:Id="Timestamp-10879099-ff09-4aa8-bcb8-7f2cf8819781"> <wsu:Created>2019-01-08T16:04:53Z</wsu:Created> <wsu:Expires>2019-01-08T16:09:53Z</wsu:Expires> </wsu:Timestamp> </wsse:Security> </soap:Header> <soap:Body> <soap:Fault> <faultcode>soap:Client</faultcode> <faultstring>This Account lacks sufficient permissions. </faultstring> <faultactor>missing in Web.Config</faultactor> <detail> <ErrorCode xmlns="missing in Web.Config">111</ErrorCode> <ErrorReason xmlns="missing in Web.Config">This Account lacks sufficient permissions.</ErrorReason> </detail> </soap:Fault> </soap:Body> </soap:Envelope>，IntervalIndex，pd.cut和groupby的组合是获得所需内容的相对简单明了的方法。

我首先要制作DataFrame（不知道这是否是从np数组中获取的最佳方法）：

agg

然后，您可以将bin定义为元组列表：

df = pd.DataFrame()
df['x'], df['y'] = x, y

使用具有bins = list(zip(bin_starts, bin_ends))方法的熊猫IntervalIndex创建垃圾箱，以便以后在from_tuples()中使用。这很有用，因为您不必依赖切片cut数组来解开“长短间隔的规则交替序列”-而是可以显式定义您感兴趣的bin：

bns

ii = pd.IntervalIndex.from_tuples(bins, closed='both') kwarg声明是否在间隔中包括结束成员编号。例如，对于元组closed，对于(0, 89)，间隔将包括0和89（与closed='both'，left或right相对）。 / p>

然后使用pd.cut()在数据框中创建类别列，这是一种将值划分为间隔的方法。可以使用neither kwarg指定一个IntervalIndex对象：

bin

最后，使用df['bin'] = pd.cut(df.x, bins=ii)和df.groupby()来获取您想要的任何统计信息：

.agg()

输出：

df.groupby('bin')['y'].agg(['mean', np.std])

具有不规则和交替垃圾箱的分类统计

1 个答案: