Question

对于学校项目，目标是将查询字符串模糊匹配到Song对象内的歌词字符串。整体数据结构是一个独特单词的TreeMap，与歌词中包含该单词的歌曲组合在一起。

我有包含查询字符串的初步匹配歌曲集。这里的转折是我必须根据匹配部分中的字符数（包括空格）为每个结果歌曲分配一个等级。例如，搜索“她爱你”会在匹配中返回这些内容：

“......她爱你......”甲壳虫乐队，等级= 13 “......她只是爱你......”Bonnie Raitt，等级= 18 “......她爱我，你好......”埃尔维斯普雷斯利，排名= 23

我用来对结果进行排序是：

for (int i=0; i<lyrics.length; i++) {
  if (lyrics[i].equals(query[0])) { //got the start point
  start=i; //adjust the start index point

  //loop through lyrics from start point
  for (int j=1; j<query.length; j++) {
    if (lyrics[j].equals(query[query.length-1])) {
        end=i; //found the last word
    }

    //if next lyric word doesn't match this query word
    if (!lyrics[i+j].equals(query[j])) {

    //advance loop through lyrics. when a match is found, i is adjusted to
    //the match index
    for (int k= i+j+1; k<lyrics.length; k++) {
        if (lyrics[k].equals(query[j]) || lyrics[k].equals(query[0]))
            i=k++;
        } //end inner advance loop

    } //end query string test

  }//end query test loop

  song.setRanks(start, end); //start and end points for the rank algorithm.

} //end start point test

由于结果集中的所有歌曲都包含任何特定顺序的查询词，因此它们不会全部包含在结果打印输出中。使用此算法，如果查询与任何特定长度不匹配，如何设置触发器从集合中删除歌曲？

编辑 - Lucene是解决方案吗？这是项目中的一个灰色区域，我将在明天上课。他允许我们为这个项目选择任何数据结构，但是我不知道是否使用另一个字符串匹配实现会通过集合。

编辑2 @ belisarius-我看不到编辑距离如何适用于此。 Levenshtein距离的最常见应用需要长度为n的字符串a和长度为m的字符串b，距离是== b所需的编辑数。对于这个项目，所需要的只是匹配中的字符等级，起点和终点未知。通过对上面发布的代码进行一些更改，我可以准确地找到起点和终点。如果歌词不能以任何方式适合搜索，我需要的是从集合中删除不匹配的方法。

Answer 1

您可能希望查看Levenstein distance。 Apache commons-lang库在StringUtils类的2.1版本中实现了它。

Answer 2

Patricia trie可能会为你做这件事。

通过这个看看它是否有你需要的东西。

http://code.google.com/p/patricia-trie/

从Set中删除与条件不匹配的项目

2 个答案: