Java中的近似字符串搜索(允许字符匹配多个选项)

时间:2018-10-25 19:20:30

标签: java string

我目前在Java中有一种BNDM搜索算法,但我想对其进行调整,以使字母“ N”与任何其他字母匹配。例如,字符串:“ NATG”应与“ CATG”匹配。我正在创建用于核苷酸匹配的软件,因此序列将仅为A,G,T,C,N,其中N是任何A,G,T,C。

例如:如果序列:“ ATGCN”和源:“ ATGATGAATGCC”。程序应返回与序列匹配的源的索引范围。在这种情况下,为7-11。另外,如果匹配多次,则应打印每个匹配项。由于源通常长一千个字符,因此我希望实现一种快速搜索算法。以下是我当前的BNDM代码,但这仅允许完全匹配。

我不确定下面的BNDM算法是否可以执行此操作。我对其他搜索算法持开放态度。

我已附上以下代码:

import java.util.Scanner;

public class BNDM {
public static void main(String[] args){
    Scanner sc = new Scanner(System.in);
    int sum = 5;
    String source,pattern;
    System.out.print("Enter sequence:");
    pattern = sc.nextLine(); 
    System.out.print("Enter source:");
    source= sc.nextLine(); 


    if (pattern.length() == source.length() && pattern.equals(source)) 
    {
        System.out.println("Sequence = Source");
    }

    char[] x = pattern.toCharArray(), y = source.toCharArray();
    int i, j, s, d, last, m = x.length, n = y.length;
    int[] b = new int[65536];

    /* Pre processing */
    for (i = 0; i < b.length; i++) {
        b[i] = 0;
    }
    s = 1;
    for (i = m - 1; i >= 0; i--) {
        b[x[i]] |= s;
        s <<= 1;
    }

    /* Searching phase */
    j = 0;
    while (j <= n - m) {
        i = m - 1;
        last = m;
        d = ~0;
        while (i >= 0 && d != 0) {
            d &= b[y[j + i]];
            i--;
            if (d != 0) {
                if (i >= 0) {
                    last = i + 1;
                } else {
                    System.out.println("Sequence in Source starting at 
                    position:");
                    System.out.println(j);
                    System.out.println("Sequence:");
                    System.out.println(pattern);
                    System.out.println("Source:");
                    System.out.println(source.substring(j,j+m));

                }
            }
            d <<= 1;
        }
        j += last;
      }
     }
    }

2 个答案:

答案 0 :(得分:0)

使用正则表达式可以轻松实现这种匹配:

// remember to add these at the top:
import java.util.regex.Matcher;
import java.util.regex.Pattern;



String pattern = "ATGCN";
String nucleotides = "ATGATGAATGCC";

// first convert the pattern into a proper regex
// i.e. replacing any N with [ATCG]
Pattern regex = Pattern.compile(pattern.replaceAll("N", "[ATCG]"));

// create a Matcher to find everywhere that the pattern matches
Matcher m = regex.matcher(nucleotides);

// find all the matches
while (m.find()) {
    System.out.println("Match found:");
    System.out.println("start:" + m.start());
    System.out.println("end:" + (m.end() - 1)); // minus 1 here because the end of a regex match is always off by 1
    System.out.println();
}

答案 1 :(得分:0)

public class Match {
public static void main(String[] args) {
    Scanner in = new Scanner(System.in);
    String origin = in.next();
    String match = in.next();
    Pattern pattern = Pattern.compile(match.replaceAll("N", "(A|G|T|C)"));
    Matcher matcher = pattern.matcher(origin);
    while (matcher.find()){
        System.out.println(matcher.start() + "-" + (matcher.end() - 1));
    }
}

}