Question

例如，这是数据表：

1.1       300 
1.5       200
1.7       234
2.4       356
2.8       234
3.4       456

我想将第二列中的值放入相应的区间，如前三个到1.0-2.0区间，接下来两个到2.0-3.0区间，最后一个到3.0-4.0区间。除此之外，在每个区间中，我喜欢返回大于底部90％值但小于相应区间中前10％值的值（假设在实际情况中每个区间中有许多数字）。

我想输出的是一个有2列的新表：第1列是区间边界的中间值，第2列是最后一段中提到的值。示例数据表的输出是：

1.5    300
2.5    356 
3.5    456

谢谢！

Answer 1

这是你想要的吗？

import numpy as _np
def bin_data(x, y, bins=[1.,2.,3.,4.]):
    """
    """
    import warnings
    import numpy as np

    xmin=np.min(x)
    xmax=np.max(x)

    bins_number=len(bins)-1
    xsm = np.mean([bins[:-1], bins[1:]], axis=0)
    ysm = np.zeros(bins_number)



    #-----------
    # The following process is what actually bins the data using numpy
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=RuntimeWarning)
        for i in range(bins_number):
            if i == bins_number - 1:
                sel = bins[i] <= x
            else:
                sel = (bins[i] <= x) & (x < bins[i+1])
            ysm[i] = np.percentile(y[sel], 90, interpolation='nearest')
    #-----------

    return xsm, ysm

现在输出正确：

In [25]: bin_data(x, y)
Out[25]: (array([ 1.5,  2.5,  3.5]), array([ 300.,  356.,  456.]))

使用python将数据划分为不同的间隔（间隔基于另一个列值）

1 个答案: