How to count how many points are "better" than other points in pandas dataframe?

时间:2016-10-20 13:13:12

标签: python pandas dataframe

I have a dataframe in pandas which look something like this:

>>> df[1:3]
     0      1      2     3     4    5     6     7     8
1 -0.59  -99.0  924.0  20.1   5.0  4.0  57.0  19.0   8.0
2 -1.30 -279.0  297.0  16.1  30.0  4.4  63.0  19.0  10.0

The number of points in the dataframe is ~1000. Given a set of columns, I want to find out how many time each point is "better" than the other? Given a set of n columns, a point is better than other point, if it better in at least one of the columns and equal in other columns. A point which is better in one column and worse in n-1 is not considered better because its better than the other point in at least one column.

Edit1: Example:

>>> df
     0      1      2  
1 -0.59  -99.0   924.0 
2 -1.30  -279.0  297.0 
3  2.00  -100.0  500.0
4  0.0   0.0     0.0 

If we consider only column 0, then the result would be:

1 - 1
2 - 0
3 - 3
4 - 2

because point 1 (-0.59) is only better than point 2 with respect to column 1.

Another example by taking columns - 0 and 1:

1 - 1 (only for point 2 all values i.e. column 0 and column 1 are either smaller than point 1 or lesser)
2 - 0 (since no point is has any lesser than this in any dimension)
3 - 1 (point 2)
4 - 2 (point 1 and 2)

Edit 2: Perhaps, something like a function which when given a dataframe, a point (index of point) and a set of columns could give the count as - for each subset of columns how many times that point is better than other points.

def f(p, df, c):
    """returns
       A list : L = [(c1,n), (c2,m)..]
       where c1 is a proper subset of c and n is the number of times that this point was better than other points."""

1 个答案:

答案 0 :(得分:1)

分别对每一列进行排名 通过对每一列进行排名,我可以准确地看到该列中的特定行中有多少其他行大于。

d1 = df.rank().sub(1)
d1

enter image description here

要解决您的问题,逻辑上必须是这样的情况:对于特定行,行元素中的最小等级恰好是该行中每个元素都大于的其他行的数量。

对于前两列[0, 1],可以通过取d1

的最小值来计算

我使用它作为参考来比较原始的前两列与等级

pd.concat([df.iloc[:, :2], d1.iloc[:, :2]], axis=1, keys=['raw', 'ranked'])

enter image description here

如上所述拿分钟。

d1.iloc[:, :2].min(1)

1    1.0
2    0.0
3    1.0
4    2.0
dtype: float64

将结果放在原始数据和排名旁边,以便我们可以看到它

pd.concat([df.iloc[:, :2], d1.iloc[:, :2], d1.iloc[:, :2].min(1)],
          axis=1, keys=['raw', 'ranked', 'results'])

enter image description here

果然,这与你的预期结果有关。