不断更新中位数+空间效率

时间:2019-06-16 17:59:16

标签: algorithm mean space pseudocode median

也许我不是在寻找/搜索正确的关键字(我找不到解决方案)。

我正在尝试以节省空间的方式计算数字列表的中位数(不断更新)。

对于计算平均值,有一种不错的方法,即存储列表中元素的数量并加权旧的平均值。例如(伪代码):

// Initialize values
noList   = [8,10,4,6]
mean     = 0
noItems  = 0

// Now we want to update the mean continually with further values.
for (value : noList) {
  mean    = (noItems / (noItems + 1)) * mean + (1 / (noItems + 1)) * value
  noItems = noItems + 1
}

// After iteration 1: wholeList = [8]       ; mean = 8   ; noItems = 1
// After iteration 2: wholeList = [8,10]    ; mean = 9   ; noItems = 2
// After iteration 3: wholeList = [8,10,4]  ; mean = 7.33; noItems = 3
// After iteration 4: wholeList = [8,10,4,6]; mean = 7   ; noItems = 4

问题: 是否有类似的(节省空间的)方法来计算中位数?

已更新 我更新了问题(感谢@WillemVanOnsem)。我不仅在寻找不断更新中位数的方法,而且还在寻找一种节省空间的方法。 根据他的提示,我们可以保留两个数据结构。

Example:

// 1) We have a list for which we want to find the median.
noList   = [9,10,4,6,13,12]

// 2) We devide it into two list or datastructures (additionally we sort it).
smallerList = [4,6,9]
biggerList  = [10,12,13]

// 3) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (9 + 10) / 2 = 9.5

// 4) Next, we add a further element and want to update our median.
// We add the number 5 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5]

// 5) Obviously 5 is smaller than our current median of 9.5. So we insert it in a sorted way into smallerList:
smallerList = [4,5,6,9]
biggerList  = [10,12,13]

// 6) Now length(smallerList) > length(biggerList), So, we know, that the updated median should be the last element of smallerList.
median = 9

// 7) Next, we add a further element and want to update our median.
// We add the number 2 to our datastructures. So the new list is:
noList   = [9,10,4,6,13,12,5,2]

// 8) Obviously 2 is smaller than our current median of 9. So we insert it again in a sorted way into smallerList:
smallerList = [2,4,5,6,9]
biggerList  = [10,12,13]

// 9) Now the length of smallerList is much bigger than the length of biggerList and we need to "balance" our list by taking one element from one list and inserting it into the other list.
// We remove the element 9 from smallerList and insert it into biggerList.
smallerList = [2,4,5,6]
biggerList  = [9,10,12,13]

// 10) Both list have the same length, so the median is between the last element of smallerList und the first element of biggerList.
median = (6 + 9) / 2 = 7.5

希望如此,这很清楚。我想,这是您的提示(@WillemVanOnsem)。

是的,这可能回答了我的第一个问题...但是此解决方案的问题是,两个列表(smallerList和biggerList)都可能增长到可观的大小。假设我们有10 ^ 18个数字流,我们希望找到所有数字的中位数而不会耗尽内存。如何以节省空间的方式解决这个问题?

1 个答案:

答案 0 :(得分:1)

没有记住所有看到的数字就无法做到这一点,因为在任何时候,您过去看到的任何数字都可能成为将来的中位数。

如果到目前为止您已经看到 n 个数字,那么对于任何 i ,其中 i 个中最小的一个可能成为中位数:

  • 如果 i> n / 2 ,则下一个 2i-n 的数字较大时会发生。

  • 如果 i <= n / 2 ,则下一个 n-2i +1 个数字较小时会发生。