我有一组DNA序列如下
>gi|58445847s cyclin-dependent kinase inhibitor 1B (p27, Kip1) (CDKN1B), mRNA
ATGTCAAACGTGCGAGTGTCTAACGGGAGCCCTAGCCTGGAGCGGATGGACGCCAGGCAGGCGGAGCACC
GCCCAAGAAGCCTGGCCTCAGAAGACGTCAAACGTAA
>gi|584458479:571-1167 Homo sapiens 1B (p27, Kip1) (CDKN1B), mRNA
ATGTCAAACGTGCGAGTGTCTAACGGGAGCCCTAGCCTGGAGCGGATGGACGCCAGGCAGGCGGAGCACC
ACAAAAGAGCCAACAGAACAGAAGAAAATGTTTCAGACGGTTCCCCAAATGCCGGTTCTGTGGAGCAGAC
GCCCAAGAAGCCTGGCCTCAGAAGACGTCAAACGTAA
我只想提取[ATGC]+
,而应忽略以>
开头的行。
这是我制作的正则表达式。
(?!\>.*\n)[ATGC\n]+
但它找到的第一个小组C
位于(CDKN1B)
,然后从, mRNA
中的A开始到下一个>
之前的行
更新
以下Java代码可用于从文件中查找DNA序列。使用findWithinHorizon(pattern, 0)
代替useDelimiter(patter)
。
List<String> sequences = new ArrayList<>();
try {
s = new Scanner(new BufferedReader(new FileReader(fc.getSelectedFile())));
Pattern p = Pattern.compile("^[ACTG]+(?:\\r\\n[ACTG]+)*", Pattern.MULTILINE);
String str = s.findWithinHorizon(p, 0);
do {
sequences.add(str);
str = s.findWithinHorizon(p, 0);
} while (str != null);
} catch (FileNotFoundException e) {
System.out.println(e.getMessage());
} finally {
if (s != null) {
s.close();
}
}
答案 0 :(得分:1)
答案 1 :(得分:1)
答案 2 :(得分:1)
使用锚点开始直线和多线修改器:
(?m)^[ACTG\n]+
前瞻是没用的。
或者如果你想修剪最后一个换行符:
(?m)^[ACTG]+(?:\n[ACTG]+)*
使用java.util.scanner:
的示例String input = ">gi|58445847s cyclin-dependent kinase inhibitor 1B (p27, Kip1) (CDKN1B), mRNA\n"
+ "ATGTCAAACGTGCGAGTGTCTAACGGGAGCCCTAGCCTGGAGCGGATGGACGCCAGGCAGGCGGAGCACC\n"
+ "GCCCAAGAAGCCTGGCCTCAGAAGACGTCAAACGTAA\n"
+ "\n"
+ ">gi|584458479:571-1167 Homo sapiens 1B (p27, Kip1) (CDKN1B), mRNA\n"
+ "ATGTCAAACGTGCGAGTGTCTAACGGGAGCCCTAGCCTGGAGCGGATGGACGCCAGGCAGGCGGAGCACC\n"
+ "ACAAAAGAGCCAACAGAACAGAAGAAAATGTTTCAGACGGTTCCCCAAATGCCGGTTCTGTGGAGCAGAC\n"
+ "GCCCAAGAAGCCTGGCCTCAGAAGACGTCAAACGTAA\n";
Pattern p = Pattern.compile("^[ACTG]+(?:\\n[ACTG]+)*$", Pattern.MULTILINE);
Scanner s = new Scanner(input);
s.useDelimiter("\n");
while(s.hasNext()) {
System.out.println(s.findWithinHorizon(p, 0) + "\n");
}