Question

如何对大型数据集实际使用合并排序？

假设我有几个带有以下数据的已排序文件：

的1.txt

1
2
2

2.txt

3
4
5

3.txt

1
1
1

假设我们无法同时将所有文件的内容保存在内存中（假设我们只能从每个文件中保存两个数字）。

我听说在这种情况下我可以使用某种R-way合并排序，但我不明白我该怎么做呢。

如您所见，第一次迭代将为我们提供以下排序序列：

1 1 1 2 3 4

，所以我们将它刷新到输出文件。但是，我们将在下一次迭代中再次获取1（来自3.txt文件），因此整个结果序列是错误的！

Answer 1

首先填充与文件一样多的变量，一个变量附加到一个文件。在每个步骤中找到三个变量的最低值，并在从同一文件中再次填充它时将其刷新到输出。

| 1.txt | 2.txt | 3.txt |
| 1     | 3     | 1     | output 1 refill from file 1
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | 1     | output 1 refill from file 3
| 2     | 3     | nil   | output 2 refill from file 1
| 2     | 3     | nil   | output 2 refill from file 1
| nil   | 3     | nil   | output 3 refill from file 2
| nil   | 4     | nil   | output 4 refill from file 2
| nil   | 5     | nil   | output 5 refill from file 2
| nil   | nil   | nil   | end

Answer 2

I heard that I can use some kind of R-way merge sort in this case but I don't understand how can I actually do it.

N路合并很容易解释。打开所有文件，从每个文件中获取第一个元素并将它们放入堆中。然后算法从堆中获取最小元素（pop），将其写入输出缓冲区，然后从该项源自的文件中读取下一个元素。重复，直到所有文件都为空。

如何对大型数据集实际使用合并排序

2 个答案: