我有两个不同长度的列(订单10)数据集(每行是一个记录),必须成为相同的行数:标准是在多个列上进行分箱,从2开始到4 ,然后删除两个数据集之一中的超出记录(在该bin中的所有记录之间随机选取)。
我目前正在使用 numpy ,但也可以使用 pandas 。
因为我事先知道一个数据集小于另一个我的(天真的让我说)想法是计算两个直方图(较小的第一个),从另一个减去一个以在每个bin中有差异,走删除超出记录的数据集,但是:我必须知道什么记录在哪个bin !!
用于计算python
中的直方图的代码段(为简单起见,两列数据集):
import numpy as np
import numpy.random as rd
x = 50*rd.random((100, 5))
np.histogram2d(x[:, 0], x[:, 1], bins=[10, 5])
在分箱时有没有办法跟踪数据集索引?
我知道pandas
数据框可以有索引,所以它们可以是一个很自然的选择,前提是我坚持使用这个算法。
有没有更聪明的方法来做到这一点,改变算法但坚持使用python?
答案 0 :(得分:0)
我使用pandas
找到了一个很好的解决方案。
import pandas as pd, numpy as np
x = 50 * np.random.randn(50, 5)
dfx = pd.DataFrame(x)
bins = np.linspace(min(dfx[0]), max(dfx[0]), 10)
first_binning = pd.cut(dfx[0], bins)
bins = np.linspace(min(dfx[1]), max(dfx[1]), 5)
second_binning = pd.cut(ddx[1], bins)
groups = dfx.groupby([first_binning, second_binning])
现在你可以制作(取决于你的数据):
In [160]: groups.size()
Out[160]:
0 1
(-101.273, -71.403] (50.481, 109.902] 2
(-71.403, -41.532] (-68.362, -8.94] 4
(-8.94, 50.481] 3
(50.481, 109.902] 1
(-41.532, -11.661] (-68.362, -8.94] 4
(-8.94, 50.481] 3
(50.481, 109.902] 2
(-11.661, 18.21] (-127.783, -68.362] 2
(-8.94, 50.481] 6
(50.481, 109.902] 1
(18.21, 48.0806] (-127.783, -68.362] 2
(-68.362, -8.94] 5
(-8.94, 50.481] 3
(50.481, 109.902] 3
(48.0806, 77.951] (-68.362, -8.94] 2
(-8.94, 50.481] 4
(77.951, 107.822] (-68.362, -8.94] 1
dtype: int64
查看计数,
In [163]: groups.indices
Out[163]:
{('(-101.273, -71.403]', '(50.481, 109.902]'): array([20, 37]),
('(-11.661, 18.21]', '(-127.783, -68.362]'): array([26, 39]),
('(-11.661, 18.21]', '(-8.94, 50.481]'): array([ 4, 14, 18, 34, 35, 45]),
('(-11.661, 18.21]', '(50.481, 109.902]'): array([17]),
('(-41.532, -11.661]', '(-68.362, -8.94]'): array([ 3, 13, 16, 30]),
('(-41.532, -11.661]', '(-8.94, 50.481]'): array([25, 38, 48]),
('(-41.532, -11.661]', '(50.481, 109.902]'): array([0, 5]),
('(-71.403, -41.532]', '(-68.362, -8.94]'): array([ 1, 24, 32, 47]),
('(-71.403, -41.532]', '(-8.94, 50.481]'): array([ 6, 19, 31]),
('(-71.403, -41.532]', '(50.481, 109.902]'): array([12]),
('(18.21, 48.0806]', '(-127.783, -68.362]'): array([21, 46]),
('(18.21, 48.0806]', '(-68.362, -8.94]'): array([ 2, 15, 22, 33, 40]),
('(18.21, 48.0806]', '(-8.94, 50.481]'): array([ 7, 28, 36]),
('(18.21, 48.0806]', '(50.481, 109.902]'): array([ 9, 23, 49]),
('(48.0806, 77.951]', '(-68.362, -8.94]'): array([41, 42]),
('(48.0806, 77.951]', '(-8.94, 50.481]'): array([27, 29, 43, 44]),
('(77.951, 107.822]', '(-68.362, -8.94]'): array([11])}
当然要看数据集记录索引。