鉴于多个文本彼此略有不同(某些单词缺失/被其他单词替换),是否有一个很好的算法可以从中创建某种“模板”?例如:
Input:
- Hello my name is Bob
- Hi my name is Bob
- Hi my name is John
Output:
- * my name is *
算法应尽可能容忍异常值,例如,如果我添加第四个输入:
- Hello i am Bob
它不应该对结果造成太大影响。
答案 0 :(得分:0)
生物学中存在类似的问题:http://en.wikipedia.org/wiki/Consensus_sequence
答案 1 :(得分:0)
这就是我实现它的方式。这不是花哨或使用最好的approuch,但是诀窍!
我使用java实现,你可以从中提取算法。
public static String matchText(List<String> textsSet)
{
double maxWeight = 0;
String subStringMaxWeight = "";
// create a table to store all substring occurrences
Hashtable<String, Integer> subStringsSet = new Hashtable<String, Integer>();
// starts a loop through text's list
for (String text : textsSet)
{
// splits text into words (separeted by spaces)
String[] subStrings = text.split(" ");
// if text has more than 1 word, then we are considering it as a "phrase"
if (subStrings.length > 1)
{
// starts a loop to generate all possible word combination
// "aaa bbb ccc ddd eee" text will generate these combinations:
// aaa bbb
// aaa bbb ccc
// aaa bbb ccc ddd
// aaa bbb ccc ddd eee
// bbb ccc
// bbb ccc ddd
// bbb ccc ddd eee
// ccc ddd
// ccc ddd eee
// ddd eee
for (int posIni = 0; posIni < subStrings.length-1; posIni++)
{
for (int posEnd = posIni + 1; posEnd < subStrings.length; posEnd++)
{
StringBuilder subString = new StringBuilder();
// if posIni is not the first word, adds * at the begining
if (posIni > 0)
subString.append("*");
// generate the substring from posIni to posEnd
for (int pos = posIni; pos <= posEnd; pos++)
{
if (subString.length() > 0)
subString.append(" ");
subString.append(subStrings[pos]);
}
// if posEnd is not the last word, adds * at the end
if (posEnd < subStrings.length - 1)
subString.append(" *");
// check if substring already exists in the Hashtable
Integer counter = subStringsSet.get(subString.toString());
if (counter == null)
{
// substring does not exist, so stores it with the counter = 1
counter = new Integer(1);
subStringsSet.put(subString.toString(), counter);
}
else
{
// substring exists, so increments it's counter
subStringsSet.remove(subString.toString());
counter = new Integer(counter.intValue() + 1);
subStringsSet.put(subString.toString(), counter);
}
// calculate the weight of the substring based on the number of occurrences plus weight based on the number of words.
// counter has heavier weight, but the number of words also has some weight (defined here as 0.05 - feel free to change it)
double weight = counter.intValue() + ((posEnd - posIni + 1) * 0.05);
if (weight > maxWeight)
{
maxWeight = weight;
subStringMaxWeight = subString.toString();
}
}
}
}
}
// returns the substring
return subStringMaxWeight;