Question

我有一项关于比较2个DNA序列，找到它们的子串并计算频率向量的作业。找到2个向量的Cos（角度）将给出百分比。（在这种情况下，我们将人类与动物进行比较）。

作业上传到此处：Download pdf

例如，我有一个输入.txt文件，如下所示：

Human 2144721 HBHU 4HHB   MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG   AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN   ALAHKYH

大猩猩232230   MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG   AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN   ALAHKYH

蜘蛛猴122567   VHLTGEEKAAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMSNPKVKAHGKKVLGA   FSDGLAHLDNLKGTFAQLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPQLQAAYQKVVAGVANA   LAHKYH

...

输出应该是介于0和1之间的两倍。

到目前为止我的代码：

enter code here
import edu.princeton.cs.algs4.*;


public class betterDNACompare{

static int k = 20;
static int d = 10000;

static int[] hashHuman;
static int[] hashAnimal;

public static int[] findSubstring(String inputDNA)
{
    int[] auxArray = new int[d];
    for(int i=0; i<inputDNA.length()-k+1;i++)
    {
        String sub = inputDNA.substring(i,k+i);

        if(hashFunction(sub)>=0&&hashFunction(sub)<d)auxArray[hashFunction(sub)]++;
        //auxArray[hashFunction(sub)]++;
    }
    return auxArray;
}

public static void calculateProp()
{
    int[] p = hashHuman;
    int[] q = hashAnimal;

    double dotProduct =0;
    for(int i=0; i<p.length; i++ )
    {
        dotProduct+= p[i]*q[i];
    }
    double magnitudeP =0;
    double magnitudeQ =0;
    double x=0;
    double y=0;
    for(int i=0;i<p.length;i++)
    {

        x = x + (Math.pow(p[i], 2));
        y = y + (Math.pow(q[i], 2));

    }
    magnitudeP = Math.sqrt((double)x);
    magnitudeQ = Math.sqrt((double)y);

    double cosAlpha = (dotProduct/(magnitudeP*magnitudeQ));

    System.out.println("Animal and human: "+((cosAlpha)));

}

private static int hashFunction(String substring)
{
    return Math.abs(substring.hashCode()%d);
}

public static void main(String[] arg)
{
    String oneBigString = "";
    StringBuilder oneBig = new StringBuilder();
    System.out.println("Starting");

    int shitKode = 0;

    while(!StdIn.isEmpty()) {
        String line = StdIn.readLine();
        if(line.contains(">")){
            if(hashAnimal != null){
                System.out.println("Human + " + line);
                calculateProp();
            }
            if(line.contains("Gorilla")){
                hashHuman = findSubstring(oneBig.toString());
                oneBig = new StringBuilder();
            }
            else if(shitKode > 0){

                hashAnimal =findSubstring(oneBig.toString()) ;
                oneBig = new StringBuilder();
            } else{
                oneBig = new StringBuilder();
                shitKode++;
            }
        } else{
            oneBig.append(line);
        }
    }

}   
}

我正在使用普林斯顿算法包用于StdIn，但与Java的扫描仪类99％相同。

它要么给我太低的价值要么太高。

我的理解是：

我必须找到DNA序列的所有子串
我必须找到子串在DNA串中出现的次数
找到2个频率指示器并计算它们之间的角度。如果它们之间的角度很小，那么人类和动物之间就会有很高的相似性。

我该如何解决这个问题。

比较2个DNA序列以发现它们之间的相似性

0 个答案: