外部合并排序效率

时间:2014-08-15 16:11:34

标签: java performance algorithm sorting mergesort

我已经实现了一个外部mergesort来排序一个由Java int原语组成的文件,但是它非常慢(幸运的是它至少可以工作)。 排序方法很少发生;它只是递归地调用与blockSize的合并,每次调用加倍,并且每次都交换输入和输出文件。 我可以在这里失去那么多时间吗?

//Merge stage of external mergesort
//Read from input file, already sorted into blocks of size blockSize
//Write to output file, sorted into blocks of 2*blockSize
public static void merge(String inputFile, String outputFile, long blockSize)
    throws IOException
{
  //readers for block1/2
  FileInputStream fis1 = new FileInputStream(inputFile);
  DataInputStream dis1 = new DataInputStream(fis1);
  FileInputStream fis2 = new FileInputStream(inputFile);
  DataInputStream dis2 = new DataInputStream(fis2);

  //writer to output file
  FileOutputStream fos = new FileOutputStream(outputFile);
  DataOutputStream dos = new DataOutputStream(fos);

  // merging 2 sub lists
  // go along pairs of blocks in inputFile
  // continue until end of input

  //initialise block2 at right position
  dis2.skipBytes((int) blockSize);

  //while we haven't reached the end of the file
  while (dis1.available() > 0)
    {
      // if block1 is last block, copy block1 to output
      if (dis2.available() <= 0)
        {
          while (dis1.available() > 0) 
            dos.writeInt(dis1.readInt());
          break;
        }
      // if block1 not last block, merge block1 and block2
      else
        {
          long block1Pos = 0;
          long block2Pos = 0;
          boolean block1Over = false;
          boolean block2Over = false;

          //data read from each block
          int e1 = dis1.readInt();
          int e2 = dis2.readInt();

          //keep going until fully examined both blocks
          while (!block1Over | !block2Over)
            {
              //copy from block 1 if:
              //  block1 hasnt been fully examined AND
              //  block1 element less than block2s OR block2 has been fully examined
              while ( !block1Over & ((e1 <= e2) | block2Over) )
                {
                  dos.writeInt(e1); block1Pos += 4;
                  if (block1Pos < blockSize & dis1.available() > 0) 
                    e1 = dis1.readInt();
                  else 
                    block1Over = true;
                }
              //same for block2
              while ( !block2Over & ((e2 < e1) | block1Over) )
                {
                  dos.writeInt(e2); block2Pos += 4;
                  if (block2Pos < blockSize & dis2.available() > 0) 
                    e2 = dis2.readInt();
                  else 
                    block2Over = true;
                }
            }
        }
      // skip to next blocks
      dis1.skipBytes((int) blockSize);
      dis2.skipBytes((int) blockSize);
    }
  dis1.close();
  dis2.close();
  dos.close();
  fos.close();
}

1 个答案:

答案 0 :(得分:0)

  1. 没有缓冲。在任何地方添加BufferedInputStreams和BufferedOutputStreams。

  2. 滥用available()。它不是流结束的有效测试,每次调用它都是一个额外的系统调用。只需等待流指示的真实结束。

  3. 非最佳初始分布。您收到单个块大小的事实表明您没有使用替换选择分配,因此您的初始运行最多可能是它们的一半。这对所需的合并传递数量具有指数影响。

  4. 不平衡合并。您需要在合并阶段的开始添加虚拟运行,以便您的上一次合并是N路,而不是在最坏的情况下,双向。这可以节省几乎整个数据的额外传递。因此,在开始合并之前,您需要知道初始运行的次数。