多文本比较算法

时间:2014-01-22 14:54:45

标签: string algorithm text comparison

鉴于多个文本彼此略有不同(某些单词缺失/被其他单词替换),是否有一个很好的算法可以从中创建某种“模板”?例如:

Input:
 - Hello my name is Bob
 - Hi my name is Bob
 - Hi my name is John

Output:
 - * my name is *

算法应尽可能容忍异常值,例如,如果我添加第四个输入:

- Hello i am Bob

它不应该对结果造成太大影响。

2 个答案:

答案 0 :(得分:0)

生物学中存在类似的问题:http://en.wikipedia.org/wiki/Consensus_sequence

答案 1 :(得分:0)

这就是我实现它的方式。这不是花哨或使用最好的approuch,但是诀窍!

我使用java实现,你可以从中提取算法。

public static String matchText(List<String> textsSet)
{
    double maxWeight = 0;
    String subStringMaxWeight = "";

    // create a table to store all substring occurrences
    Hashtable<String, Integer> subStringsSet = new Hashtable<String, Integer>();

    // starts a loop through text's list
    for (String text : textsSet)
    {
        // splits text into words (separeted by spaces)
        String[] subStrings = text.split(" ");
        // if text has more than 1 word, then we are considering it as a "phrase"
        if (subStrings.length > 1)
        {
            // starts a loop to generate all possible word combination
            // "aaa bbb ccc ddd eee" text will generate these combinations:
            //  aaa bbb
            //  aaa bbb ccc
            //  aaa bbb ccc ddd
            //  aaa bbb ccc ddd eee
            //      bbb ccc
            //      bbb ccc ddd
            //      bbb ccc ddd eee
            //          ccc ddd
            //          ccc ddd eee
            //              ddd eee
            for (int posIni = 0; posIni < subStrings.length-1; posIni++)
            {
                for (int posEnd = posIni + 1; posEnd < subStrings.length; posEnd++)
                {
                    StringBuilder subString = new StringBuilder();

                    // if posIni is not the first word, adds * at the begining
                    if (posIni > 0)
                        subString.append("*");

                    // generate the substring from posIni to posEnd
                    for (int pos = posIni; pos <= posEnd; pos++)
                    {
                        if (subString.length() > 0)
                            subString.append(" ");
                        subString.append(subStrings[pos]);
                    }

                    // if posEnd is not the last word, adds * at the end
                    if (posEnd < subStrings.length - 1)
                        subString.append(" *");

                    // check if substring already exists in the Hashtable                       
                    Integer counter = subStringsSet.get(subString.toString());
                    if (counter == null)
                    {
                        // substring does not exist, so stores it with the counter = 1
                        counter = new Integer(1);
                        subStringsSet.put(subString.toString(), counter);
                    }
                    else
                    {
                        // substring exists, so increments it's counter
                        subStringsSet.remove(subString.toString());
                        counter = new Integer(counter.intValue() + 1);
                        subStringsSet.put(subString.toString(), counter);
                    }

                    // calculate the weight of the substring based on the number of occurrences plus weight based on the number of words.
                    // counter has heavier weight, but the number of words also has some weight (defined here as 0.05 - feel free to change it)
                    double weight = counter.intValue() + ((posEnd - posIni + 1) * 0.05);
                    if (weight > maxWeight)
                    {
                        maxWeight = weight;
                        subStringMaxWeight = subString.toString();
                    }

                }

            }
        }
    }
    // returns the substring
    return subStringMaxWeight;