Question

我已经实现了一个外部mergesort来排序一个由Java int原语组成的文件，但是它非常慢（幸运的是它至少可以工作）。排序方法很少发生;它只是递归地调用与blockSize的合并，每次调用加倍，并且每次都交换输入和输出文件。我可以在这里失去那么多时间吗？

//Merge stage of external mergesort
//Read from input file, already sorted into blocks of size blockSize
//Write to output file, sorted into blocks of 2*blockSize
public static void merge(String inputFile, String outputFile, long blockSize)
    throws IOException
{
  //readers for block1/2
  FileInputStream fis1 = new FileInputStream(inputFile);
  DataInputStream dis1 = new DataInputStream(fis1);
  FileInputStream fis2 = new FileInputStream(inputFile);
  DataInputStream dis2 = new DataInputStream(fis2);

  //writer to output file
  FileOutputStream fos = new FileOutputStream(outputFile);
  DataOutputStream dos = new DataOutputStream(fos);

  // merging 2 sub lists
  // go along pairs of blocks in inputFile
  // continue until end of input

  //initialise block2 at right position
  dis2.skipBytes((int) blockSize);

  //while we haven't reached the end of the file
  while (dis1.available() > 0)
    {
      // if block1 is last block, copy block1 to output
      if (dis2.available() <= 0)
        {
          while (dis1.available() > 0) 
            dos.writeInt(dis1.readInt());
          break;
        }
      // if block1 not last block, merge block1 and block2
      else
        {
          long block1Pos = 0;
          long block2Pos = 0;
          boolean block1Over = false;
          boolean block2Over = false;

          //data read from each block
          int e1 = dis1.readInt();
          int e2 = dis2.readInt();

          //keep going until fully examined both blocks
          while (!block1Over | !block2Over)
            {
              //copy from block 1 if:
              //  block1 hasnt been fully examined AND
              //  block1 element less than block2s OR block2 has been fully examined
              while ( !block1Over & ((e1 <= e2) | block2Over) )
                {
                  dos.writeInt(e1); block1Pos += 4;
                  if (block1Pos < blockSize & dis1.available() > 0) 
                    e1 = dis1.readInt();
                  else 
                    block1Over = true;
                }
              //same for block2
              while ( !block2Over & ((e2 < e1) | block1Over) )
                {
                  dos.writeInt(e2); block2Pos += 4;
                  if (block2Pos < blockSize & dis2.available() > 0) 
                    e2 = dis2.readInt();
                  else 
                    block2Over = true;
                }
            }
        }
      // skip to next blocks
      dis1.skipBytes((int) blockSize);
      dis2.skipBytes((int) blockSize);
    }
  dis1.close();
  dis2.close();
  dos.close();
  fos.close();
}

Answer 1

没有缓冲。在任何地方添加BufferedInputStreams和BufferedOutputStreams。
滥用available（）。它不是流结束的有效测试，每次调用它都是一个额外的系统调用。只需等待流指示的真实结束。
非最佳初始分布。您收到单个块大小的事实表明您没有使用替换选择分配，因此您的初始运行最多可能是它们的一半。这对所需的合并传递数量具有指数影响。
不平衡合并。您需要在合并阶段的开始添加虚拟运行，以便您的上一次合并是N路，而不是在最坏的情况下，双向。这可以节省几乎整个数据的额外传递。因此，在开始合并之前，您需要知道初始运行的次数。

外部合并排序效率

1 个答案: