Question

问题在于：我的计算机只有1GB内存。我有一个10 GB数据的文本文件。这个文件包含数字。我该如何对它们进行排序？

添加更多细节。

 -They are all integers like 10000, 16723998 etc.   
 -same integer values can be repeatedly appearing in the file.

Answer 1

将文件拆分为可以就地排序的部分（缓冲区）

然后当所有缓冲区被排序时，当时取2（或更多）并合并它们（如merge sort）直到只剩下1个缓冲区，这将是已排序的文件

Answer 2

Knuth提出的外部排序怎么样？ see 4.1，Wikipedia或TAOCP, Sorting and Searching。

Answer 3

请参阅此link。这个家伙解释得很美。

An example of disk-based application: External mergesort algorithm (wikipedia)
A merge sort divides the unsorted list into n sublists, each containing 1 element, and then repeatedly merges sublists to produce new sorted sublists until there is only 1 sublist remaining.
The external mergesort algorithm sorts chunks that each fit in RAM, then merges the sorted chunks together.For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
1. Read 100 MB of the data in main memory and sort by some conventional sorting method, like quicksort.
2. Write the sorted data to disk.
3. Repeat steps 1 and 2 until all of the data is in sorted 100 MB chunks (there are 900MB / 100MB = 9 chunks), which now need to be merged into one single output file.
4. Read the first 10 MB of each sorted chunk (of 100 MB) into input buffers in main memory and allocate the remaining 10 MB for an output buffer. (In practice, it might provide better performance to make the output buffer larger and the input buffers slightly smaller.)
5. Perform a 9-way merge and store the result in the output buffer. Whenever the output buffer fills, write it to the final sorted file and empty it. Whenever any of the 9 input buffers empties, fill it with the next 10 MB of its associated 100 MB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally -- because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.

Answer 4

将10GB缓冲区拆分为10 * 1GB缓冲区以heap（最小或最大）处理所有10 GB数据一次，然后我们将在min_heap中留下1gb的已排序数据和9 gb的未排序数据... 然后对9GB数据进行相同操作获得所有排序...

Answer 5

仅使用1 GB RAM来排序10 GB数据：

读取主内存中的1 GB数据并使用quicksort进行排序。
将排序后的数据写入磁盘。
重复步骤1和2，直到所有数据按排序的1GB块（有10 GB / 1 GB = 10块）分类，现在需要将其合并到一个输出文件中。
将每个排序的块的前90 MB（1 GB）读入主存储器中的输入缓冲区，并将剩余的100 MB分配给输出缓冲区。（为获得更好的性能，我们可以将输出缓冲区变大，将输入缓冲区变小。）
执行10路合并，并将结果存储在输出缓冲区中。
每当输出缓冲区填满时，将其写入最终排序的文件并清空。每当90 MB输入缓冲区中的任何一个为空时，请为其关联的1 GB排序块中的下一个90 MB填充它，直到该块中没有更多数据可用为止。

这是在外部起作用的外部合并排序方法。

Answer 6

我们使用归并排序先将数据划分然后合并。

将数据分成 10 组，每组 1GB。
对每个组进行排序并将它们写入磁盘。
将每组中的 10 个项目加载到主内存中。
将主存中最小的项输出到磁盘。从选择了项目的组中加载下一个项目。
循环第 4 步，直到所有项目都没有输出。

在1 GB内存中对10GB数据进行排序。我该怎么办？

6 个答案: