我对中位数算法的中位数不了解

时间:2018-09-22 22:14:09

标签: algorithm median-of-medians

对于中位数的中位数算法,我有些不了解。 该算法的关键一步是找到一个近似中值,根据Wikipedia,我们保证该近似中值大于初始集合元素的30%。

要找到此近似中值,我们计算每组5个元素的中值,将这些中值收集到一个新集合中,然后重新计算中值,直到获得的集合中的元素少于5个为止。在这种情况下,我们得到集合的中位数。 (如果我的解释不清楚,请参阅Wikipedia页面)

但是,请考虑以下125个元素集:

1 2 3 1001 1002
4 5 6 1003 1004
7 8 9 1005 1006
1020 1021 1022 1023 1034 
1025 1026 1027 1028 1035 

10 11 12 1007 1008
13 14 15 1009 1010
16 17 18 1011 1013
1029 1030 1031 1032 1033 
1036 1037 1038 1039 1040 

19 20 21 1014 1015
22 23 24 1016 1017
25 26 27 1018 1019
1041 1042 1043 1044 1045
1046 1047 1048 1049 1050

1051 1052 1053 1054 1055
1056 1057 1058 1059 1060
1061 1062 1063 1064 1065
1066 1067 1068 1069 1070
1071 1072 1073 1074 1075

1076 1077 1078 1079 1080
1081 1082 1083 1084 1085
1086 1087 1088 1089 1090
1091 1092 1093 1094 1095
1096 1097 1098 1099 1100 

因此,我们将集合分为5个元素的组,我们计算并收集了中位数,因此,我们获得了以下集合:

3 6 9 1022 1207
12 15 18 1031 1038
21 24 27 1043 1048
1053 1058 1063 1068 1073
1078 1083 1088 1093 1098

我们重做相同的算法,并获得以下集合:

9 18 27 1063 1068

因此,我们得出的近似中值是27。但是这个数字大于或等于仅27个元素。而27/125 = 21.6%<30%!!

所以我的问题是:我在哪里错了?为什么在我的情况下,近似中位数不大于元素的30%?

谢谢您的回复!

2 个答案:

答案 0 :(得分:3)

I'm completely with your analysis up through the point where you get the medians of each of the blocks of five elements, when you're left with this collection of elements:

3 6 9 1022 1207 12 15 18 1031 1038  21 24 27 1043 1048 1053 1058 1063 1068 1073 1078 1083 1088 1093 1098

You are correct that, at this point, we need to get the median of this collection of elements. However, the way that the median-of-medians algorithm accomplishes this is different than what you've proposed.

When you were working through your analysis, you attempted to get the median of this set of values by, once again, splitting the input into blocks of size five and taking the median of each. However, that approach won't actually give you the median of the medians. (You can see this by noting that you got back 27, which isn't the true median of that collection of values).

The way that the median-of-medians algorithm actually gets back the median of the medians is by recursively invoking the overall algorithm to obtain the median of those elements. This is subtly different from just repeatedly breaking things apart into blocks and computing the medians of each block. In particular, each recursive call will

  • get an estimate of the pivot by using the groups-of-five heuristic,
  • recursively invoke the function on itself to find the median of those medians, then
  • apply a partitioning step on that median and use that to determine how to proceed from there.

This algorithm is, in my opinion, something that's way too complicated to actually trace through by hand. You really need to trust that, since each recursive call you're making works on a smaller array than what you started with, each recursive call will indeed do what it says to do. So when you're left with the medians of each group, as you were before, you should just trust that when you need to get the median by a recursive call, you end up with the true median.

If you look at the true median of the medians that you've generated in the first step, you'll find that it indeed will be between the 30th and 70th percentiles of the original data set.

If this seems confusing, don't worry - you're in really good company. This algorithm is famously tricky to understand. For me, the easiest way to understand it is to just trust that recursion works and to trace through it only one layer deep, working under the assumption that all the recursive calls work, rather than trying to walk all the way down to the bottom of the recursion tree.

答案 1 :(得分:3)

您对中位数中位数算法感到困惑的原因是,尽管中位数中位数返回的近似结果在实际中位数的20%以内,但在算法的某些阶段,我们还需要计算准确的中位数。如果将两者混为一谈,您将无法获得预期的结果,如您的示例所示。

中位数中位数使用三个功能作为其构建基块:

medianOfFive(array, first, last) {
    // ...
    return median;
}

此函数返回数组(部分)中五个(或更少)元素的精确中值。有多种方式可以对此进行编码,例如基于排序网络或插入排序。细节对于这个问题并不重要,但是必须注意,该函数返回的是准确的中位数,而不是近似值。

medianOfMedians(array, first, last) {
    // ...
    return median;
}

此函数返回数组(部分)的中值的近似值,该值保证大于最小30%的元素,并且小于最大30%的元素。我们将在下面详细介绍。

select(array, first, last, n) {
    // ...
    return element;
}

此函数返回数组(部分)中的第n个最小元素。此函数也返回精确结果,而不是近似值。

总体而言,整个算法的工作原理如下:

medianOfMedians(array, first, last) {
    call medianOfFive() for every group of five elements
    fill an array with these medians
    call select() for this array to find the middle element
    return this middle element (i.e. the median of medians)
}

这就是您的计算出错的地方。在创建具有五个中位数的数组之后,您可以在该数组上再次使用中位数函数,这将为您提供中位数的近似值(27),但是这里您需要实际的中位数(1038)。

这听起来很简单,但是变得复杂的是,函数select()调用meanOfMedians()来获得中位数的第一个估计值,然后将其用于计算确切的中值,因此得到两个-两个函数互相调用的方式递归。当对25个元素或更少的元素调用medianOfMedians()时,此递归停止,因为那时只有5个中值,并且可以不使用select()查找其中值,而可以使用medianOfFive()。

select()调用中位数OfMedians()的原因是它使用分区将数组(的一部分)分成大小相等的两个部分,并且需要一个很好的数据透视值。在将数组划分为两部分后,它们的元素小于和大于枢轴,然后检查第n个最小元素位于哪个部分,然后递归该部分。如果值较小的部分的大小为n-1,则枢轴为第n个值,并且不需要进一步递归。

select(array, first, last, n) {
    call medianOfMedians() to get approximate median as pivot
    partition (the range of) the array into smaller and larger than pivot
    if part with smaller elements is size n-1, return pivot
    call select() on the part which contains the n-th element
}

如您所见,select()函数递归(除非支点恰好是第n个元素),但是在数组的范围越来越小时,因此在某个点(例如,两个元素)找到第n个元素元素将变得微不足道,并且不再需要递归。

最后,我们得到了更多详细信息:

medianOfFive(array, first, last) {
    // some algorithmic magic ...
    return median;
}

medianOfMedians(array, first, last) {
    if 5 elements or fewer, call medianOfFive() and return result
    call medianOfFive() for every group of five elements
    store the results in an array medians[]
    if 5 elements or fewer, call medianOfFive() and return result
    call select(medians[]) to find the middle element
    return the result (i.e. the median of medians)
}

select(array, first, last, n) {
    if 2 elements, compare and return n-th element
    if 5 elements or fewer, call medianOfFive() to get median as pivot
    else call medianOfMedians() to get approximate median as pivot
    partition (the range of) the array into smaller and larger than pivot
    if part with smaller elements is size n-1, return pivot
    if n-th value is in part with larger values, recalculate value of n
    call select() on the part which contains the n-th element
}

示例

输入数组(125个值,25组,每组五个):

 #1    #2    #3    #4    #5    #6    #7    #8    #9    #10   #11   #12   #13   #14   #15   #16   #17   #18   #19   #20   #21   #22   #23   #24   #25

   1     4     7  1020  1025    10    13    16  1029  1036    19    22    25  1041  1046  1051  1056  1061  1066  1071  1076  1081  1086  1091  1096
   2     5     8  1021  1026    11    14    17  1030  1037    20    23    26  1042  1047  1052  1057  1062  1067  1072  1077  1082  1087  1092  1097
   3     6     9  1022  1027    12    15    18  1031  1038    21    24    27  1043  1048  1053  1058  1063  1068  1073  1078  1083  1088  1093  1098
1001  1003  1005  1023  1028  1007  1009  1011  1032  1039  1014  1016  1018  1044  1049  1054  1059  1064  1069  1074  1079  1084  1089  1094  1099
1002  1004  1006  1034  1035  1008  1010  1013  1033  1040  1015  1017  1019  1045  1050  1055  1060  1065  1070  1075  1080  1085  1090  1095  1100

五个一组(25个值)的中位数:

3, 6, 9, 1022, 1027, 12, 15, 18, 1031, 1038, 21, 24, 27, 1043,  
1048, 1053, 1058, 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

五组,大约中位数:

 #1    #2    #3    #4    #5

   3    12    21  1053  1078
   6    15    24  1058  1083
   9    18    27  1063  1088
1022  1031  1043  1068  1096
1027  1038  1048  1073  1098

5个中位数代表大约中位数:

9, 18, 27, 1063, 1088

大约中位数为枢轴:

27

用枢轴27划分的五个中位数(取决于方法):

small: 3, 6, 9, 24, 21, 12, 15, 18
pivot: 27
large: 1031, 1038, 1027, 1022, 1043, 1048, 1053, 1058,  
       1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

较小的组有8个元素,较大的组有16个元素。我们正在寻找25个中间的第13个元素,所以现在我们从16个中寻找13-8-1 =第四个元素

五人一组:

 #1    #2    #3    #4

1031  1048  1073  1098
1038  1053  1078
1027  1058  1083
1022  1063  1088
1043  1068  1093

五个一组的中位数:

1031, 1058, 1083, 1098

大约中位数为枢轴:

1058

用枢轴1058划分的五个中位数的范围(取决于方法):

small: 1031, 1038, 1027, 1022, 1043, 1048, 1053
pivot: 1058
large: 1063, 1068, 1073, 1078, 1083, 1088, 1093, 1098

较小的组有7个元素。我们正在寻找16的第4个元素,所以现在我们从7中寻找第4个元素:

五人一组:

 #1    #2

1031  1048
1038  1053
1027
1022
1043

五个一组的中位数:

1031, 1048

大约中位数为枢轴:

1031

用枢轴1031划分的五个中位数的范围(取决于方法):

small: 1022, 1027
pivot: 1031
large: 1038, 1043, 1048, 1053

较小的部分包含2个元素,较大的部分包含4个元素,因此现在我们从4个元素中寻找4-2-1 = 1st个元素:

以5个中位数为枢轴:

1043

用枢轴1043划分的五个中位数的范围(取决于方法):

small: 1038
pivot: 1043
large: 1048, 1053

较小的部分只有一个元素,我们正在寻找第一个元素,因此我们可以返回较小的元素1038。

您将看到,1038是原始25个五位中位数的精确中值,并且原始数组125中有62个较小的值:

1 ~ 27, 1001 ~ 1011, 1013 ~ 1023, 1025 ~ 1037

这不仅使它处于30%到70%的范围内,而且还意味着它实际上是准确的中位数。