我有一个用空格分隔的文件,必须从中获取特定列的数据。我的文件如下所示:
chr1.trna124 (75052562-75052633) Length: 72 bp
Type: His Anticodon: ATG at 33-35 (75052594-75052596) Score: 35.2
HMM Sc=29.40 Sec struct Sc=5.80
* | * | * | * | * | * | * |
Seq: TGGGGTATAGCTCCATGGTAGAGCGCATGCCTATGAAGCGTGAGGtCCTGGGTTTGATCCCCAGAACCACAA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
chr1.trna131 (78297795-78297866) Length: 72 bp
Type: Pro Anticodon: AGG at 33-35 (78297827-78297829) Score: 39.1
HMM Sc=24.30 Sec struct Sc=14.80
* | * | * | * | * | * | * |
Seq: GGCTTGTTGGTCTAGGGGTATGATTCTCACTTAGGGTGTGAGAGGtCCTGGGTTCAAATCTTGGACGAGTCC
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.
我要从上面提取ID,即“ chr1.trna124”列,并从Anticodon的第二行中提取:ATG在33-35,只有33-35,直到文件末尾。 最好的方法是什么? 我正在尝试将模式与“ chr”匹配的一行合并到下一个“ chr”,然后获取列。我通过How to grab the lines AFTER a matched line in python尝试过,但是我什至无法做到这一点。有什么更好的办法吗? 在python 2X和3X中有不同的方法吗?
答案 0 :(得分:1)
You can use re.findall(r"(?ms) see below"):
(1) "^[\w.]+\s\((\d+-\d+)\)" matches the ID, from start of a line;
(3) "(Anticodon:.+?)$" matches from "Anticodon" until the line end,
'^' and '$' match not only the start/end of the string but each line start/end, too, according the 'm' in (?ms);
(2) ".+?" matches anything from the end of the ID to the "Anticodon", and . matches new line, too, according to 's' in "(?ms)".
You can assemble the expression:-)