python:从数据集中删除记录,比较两个直方图

时间:2017-02-04 14:50:29

标签: python-3.x pandas numpy histogram

我有两个不同长度的列(订单10)数据集(每行是一个记录),必须成为相同的行数:标准是在多个列上进行分箱,从2开始到4 ,然后删除两个数据集之一中的超出记录(在该bin中的所有记录之间随机选取)。

我目前正在使用 numpy ,但也可以使用 pandas

因为我事先知道一个数据集小于另一个我的(天真的让我说)想法是计算两个直方图(较小的第一个),从另一个减去一个以在每个bin中有差异,走删除超出记录的数据集,但是:我必须知道什么记录在哪个bin !!

用于计算python中的直方图的代码段(为简单起见,两列数据集):

import numpy as np
import numpy.random as rd
x = 50*rd.random((100, 5))
np.histogram2d(x[:, 0], x[:, 1], bins=[10, 5])

在分箱时有没有办法跟踪数据集索引? 我知道pandas数据框可以有索引,所以它们可以是一个很自然的选择,前提是我坚持使用这个算法

有没有更聪明的方法来做到这一点,改变算法但坚持使用python?

1 个答案:

答案 0 :(得分:0)

我使用pandas找到了一个很好的解决方案。

import pandas as pd, numpy as np
x = 50 * np.random.randn(50, 5)
dfx = pd.DataFrame(x)
bins = np.linspace(min(dfx[0]), max(dfx[0]), 10)
first_binning = pd.cut(dfx[0], bins)
bins = np.linspace(min(dfx[1]), max(dfx[1]), 5)
second_binning = pd.cut(ddx[1], bins)
groups = dfx.groupby([first_binning, second_binning])

现在你可以制作(取决于你的数据):

In [160]: groups.size()
Out[160]:
0                    1
(-101.273, -71.403]  (50.481, 109.902]      2
(-71.403, -41.532]   (-68.362, -8.94]       4
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      1
(-41.532, -11.661]   (-68.362, -8.94]       4
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      2
(-11.661, 18.21]     (-127.783, -68.362]    2
                     (-8.94, 50.481]        6
                     (50.481, 109.902]      1
(18.21, 48.0806]     (-127.783, -68.362]    2
                     (-68.362, -8.94]       5
                     (-8.94, 50.481]        3
                     (50.481, 109.902]      3
(48.0806, 77.951]    (-68.362, -8.94]       2
                     (-8.94, 50.481]        4
(77.951, 107.822]    (-68.362, -8.94]       1
dtype: int64

查看计数,

In [163]: groups.indices
Out[163]:
{('(-101.273, -71.403]', '(50.481, 109.902]'): array([20, 37]),
 ('(-11.661, 18.21]', '(-127.783, -68.362]'): array([26, 39]),
 ('(-11.661, 18.21]', '(-8.94, 50.481]'): array([ 4, 14, 18, 34, 35,     45]),
 ('(-11.661, 18.21]', '(50.481, 109.902]'): array([17]),
 ('(-41.532, -11.661]', '(-68.362, -8.94]'): array([ 3, 13, 16, 30]),
 ('(-41.532, -11.661]', '(-8.94, 50.481]'): array([25, 38, 48]),
 ('(-41.532, -11.661]', '(50.481, 109.902]'): array([0, 5]),
 ('(-71.403, -41.532]', '(-68.362, -8.94]'): array([ 1, 24, 32, 47]),
 ('(-71.403, -41.532]', '(-8.94, 50.481]'): array([ 6, 19, 31]),
 ('(-71.403, -41.532]', '(50.481, 109.902]'): array([12]),
 ('(18.21, 48.0806]', '(-127.783, -68.362]'): array([21, 46]),
 ('(18.21, 48.0806]', '(-68.362, -8.94]'): array([ 2, 15, 22, 33, 40]),
 ('(18.21, 48.0806]', '(-8.94, 50.481]'): array([ 7, 28, 36]),
 ('(18.21, 48.0806]', '(50.481, 109.902]'): array([ 9, 23, 49]),
 ('(48.0806, 77.951]', '(-68.362, -8.94]'): array([41, 42]),
 ('(48.0806, 77.951]', '(-8.94, 50.481]'): array([27, 29, 43, 44]),
 ('(77.951, 107.822]', '(-68.362, -8.94]'): array([11])}

当然要看数据集记录索引。