
时间:2013-01-23 18:50:56

标签: objective-c algorithm performance


假设我们有一个非常大的数据集,我们希望得到一个随机样本 用于测试新工具的项目。而不是担心 访问事物的细节,让我们假设系统提供这些 事情:

// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);

// Interface to implementations other people write.
@interface Dataset : NSObject

// YES when there is no more data.
- (BOOL)endOfData;

// Get the next element and move forward.
- (NSString*)getNext;


// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
  // Describe how this works.

编辑:所以你应该从给定的数组中随机选择项目。因此,如果k = 5,那么我想从数据集中随机选择5个元素并返回这些项的数组。数据集中的每个元素都必须具有相同的选择机会。

2 个答案:

答案 0 :(得分:0)


1. use input parameter k to dynamically allocate an array of numbers
    unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);

2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)

3. sort numsArray

4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0.  whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.

5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.

6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.


// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
    a. count up a count variable that is initialized to 0

2. run a loop from 0 through k-1
    a. generate a random number that you constrain to the list size
    b. run a loop that moves through the dataset until it hits the rand element
    c. compare that element with all other elements in your new list to make sure it isnt already in your new list
    d. store the element into your new list



1. run a loop from 0 through k-1
    a. generate a random
    b. use the generated random as a skip count, move skip count objects through the list
    c. store the current item from the list into your new list



答案 1 :(得分:0)


相反,选择五个值范围从 0 n-1 。在不太可能的情况下,五个索引中存在重复,用另一个随机值替换副本。然后使用五个索引对群体中的第i个元素进行随机访问查找。

这很简单。它使用随机数生成器的最小调用次数。并且它仅访问内存 以进行相关选择。


如果您不允许多次迭代数据,请使用分块形式的储存采样:1)选择前五个元素作为初始样本,每个元素的概率为1/5。 2)读入大量数据并从新集中选择五个新样本(仅使用五次调用Rand)。 3)成对,决定是保留新样本项还是旧样本元素(优势与两个样本组中每一个的概率成比例)。 4)重复,直到读完所有数据。


  • 选择前五个作为初始样本:current_sample = read(5);人口= 5。
  • 读取一大块 n 数据点(在本例中可能n = 200):
    • subpop = read(200);
    • m = len(subpop);
    • new_sample = choose(5,subpop);
    • 成对循环两个样本:
      • for(a,b)in(current_sample and new_sample):if random(0 to population + m)<人口,然后保持 a ,否则保持* b)
    • 人口+ = m
    • 重复