Burrows Wheeler变换的优化

时间:2011-05-14 07:04:33

标签: java text transform burrows-wheeler-transform

您好我在优化burrows wheeler transform时遇到了一些困难。我正在尝试转换文本文件,但是像圣经这样的大文本文件转换太久了。

关于如何进行的任何想法?

public BurrowsWheelerTransformEncoder()
{

}

private String originalSuffix(int index, String string)
{
    String temp = (string.substring(index,string.length()) + string.substring(0,index));

    //this bit just 'compresses' each transformation of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so minimal amount of memory is used when it is stored in an array

    return temp.substring(0,5)+
    //the last character of the transformation is kept
           temp.charAt(temp.length()-1);
}

private String compressedSuffix(String string)
{
    //this method just 'compresses' original piece of text by producing
    //a prefix, so 'abracadabra' just becomes 'abrac'
    //this is so comprisons won't take so long
    return string.substring(0,5)+string.charAt(string.length()-1);
}

public static void main(String args[]) throws Exception
{
    BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder();
    BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt"));

    String text = "";
    //the row in the sorted array where the original text can be found
    int originalRow = 0;
    //system time when program began
    long startTime = System.nanoTime();

    //get text from file
    while(input.ready())
    {
        text += input.readLine();
    }
    //create a new array to hold all transformations
    String[] textArray = new String[text.length()];
    int length = text.length();

    //get individual transformations and put in array
    for(int i = 0; i < text.length(); i++)
    {
        textArray[i] = encoder.originalSuffix(i,text);
        //for debugging large text files, prints progress after every 10k'th 
        //transformation
        if(i%10000==0)
        System.out.println(i+"/"+length);
    }
    //uses java's internal methods to sort the array, presumably 
    //the most efficient way to do the sort (for now)
    Arrays.sort(textArray);

    String compressedOriginalText = encoder.compressedSuffix(text);

    //print the results
    for(int i = 0; i < textArray.length; i++)
    {
        if(textArray[i].equals(compressedOriginalText))
        {
            originalRow = i;
        }
        if(i%100==0)
        {
            System.out.println();
        }
        System.out.print(textArray[i].charAt(textArray[i].length()-1));
    }
    System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array.");
    System.out.println("Time elapsed: " + (System.nanoTime() - startTime));
 }

2 个答案:

答案 0 :(得分:3)

对于编码情况,您不需要实际构建字符串数组 - 使用int(或long取决于您的文件大小)数组来存储旋转字符串开始的索引。

  • 创建一个初始化为[0 1 2 3 ... n]
  • 的数组
  • 使用以下compareTo对数组进行排序(假设compareTo()可以访问原始字符串original):

    int compareTo(int a, int b){
        int compare, len = original.length();
        do{
            char _a = original.charAt(a), _b = original.charAt(b);
            compare = _a-_b;
            a++; b++;
            if(a>=len)a-=len;
            if(b>=len)b-=len;
        }while(compare==0);
        return compare;
    }
    
  • 注意数组中的索引“0”并将其作为“开始”值添加到输出中

对于逆转,我们再次希望避免为与圣经一样大的文本构建整个表。我们可以通过使用第一行和最后一行中的相同标记始终处于相同顺序的事实来实现此目的。这是正确的,因为第一行是排序的并且标记是循环排列的:对于最后一行中的三个连续的b,它们之后的标记被排序,因此b被排序。所以要扭转:

  • 对输出标记进行排序。除了存储已排序的标记外,还要存储每个标记开始的索引。因此,对于未分类的令牌“nbnaaa”,您将存储[3 4 5 2 0 1]和“aaabnn”。 重要:您必须在此步骤中使用稳定的排序。
  • 使用前面提到的“start”值来重建字符串:

    string decode(string sorted, int[]index, int start){
        string answer = ""+sorted.charAt(start);
        int next = index[start];
        while(next!=start){
            answer = sorted.charAt(next) + answer;
            next = index[next];
        }
        return answer;
    }
    

答案 1 :(得分:1)

这一行:

    String temp = (string.substring(index,string.length()) + string.substring(0,index));

每次调用时都会创建整个输入文本的副本。由于您为N个字符的输入文本调用了N次,因此您的算法将为O(N^2)

看看您是否可以优化originalSuffix方法以避免复制。