使用正则表达式从文本中提取基因序列

时间:2017-04-05 23:20:31

标签: python regex

我的代码没有给我所需的输出:

import re
f = open("cub.txt")
cub = f.read()
f.close()
r = re.compile("([A-A]{1})\s([A-A,'A'])\s([-+]?[0-9]*\.?[0-9]*)")
matches = re.findall(r, cub)
print (matches)

幼崽文本文件的内容:

UUU F 0.45 16.8 ( 45768)  UCU S 0.18 14.1 ( 38296)  UAU Y 0.40 11.8 ( 32211)  UGU C 0.40  8.8 ( 23851)
UUC F 0.55 20.2 ( 54936)  UCC S 0.20 15.7 ( 42683)  UAC Y 0.60 17.8 ( 48342)  UGC C 0.60 13.3 ( 36075)
UUA L 0.08  7.0 ( 19129)  UCA S 0.15 11.6 ( 31442)  UAA * 0.41  0.8 (  2046)  UGA * 0.59  1.1 (  2986)
UUG L 0.13 12.6 ( 34146)  UCG S 0.07  5.2 ( 14079)  UAG Q 0.01  0.5 (  1281)  UGG W 1.00 12.0 ( 32616)

CUU L 0.13 12.4 ( 33708)  CCU P 0.27 15.3 ( 41672)  CAU H 0.40  9.5 ( 25885)  CGU R 0.10  5.4 ( 14682)
CUC L 0.18 16.8 ( 45753)  CCC P 0.30 17.0 ( 46097)  CAC H 0.60 14.4 ( 39081)  CGC R 0.19 10.4 ( 28305)
CUA L 0.06  6.0 ( 16211)  CCA P 0.28 15.7 ( 42767)  CAA Q 0.27 12.1 ( 33018)  CGA R 0.10  5.3 ( 14339)
CUG L 0.41 38.5 (104699)  CCG P 0.14  7.8 ( 21091)  CAG Q 0.72 32.6 ( 88743)  CGG R 0.18  9.7 ( 26453)

AUU I 0.35 16.8 ( 45653)  ACU T 0.25 13.3 ( 36078)  AAU N 0.43 16.9 ( 46039)  AGU S 0.14 11.2 ( 30390)
AUC I 0.46 22.0 ( 59906)  ACC T 0.31 16.5 ( 44951)  AAC N 0.57 22.5 ( 61099)  AGC S 0.26 20.2 ( 54867)
AUA I 0.18  8.8 ( 23805)  ACA T 0.30 16.1 ( 43884)  AAA K 0.44 27.3 ( 74256)  AGA R 0.22 12.2 ( 33289)
AUG M 1.00 23.2 ( 62972)  ACG T 0.14  7.7 ( 20943)  AAG K 0.56 34.3 ( 93393)  AGG R 0.21 11.7 ( 31945)

GUU V 0.21 13.1 ( 35593)  GCU A 0.29 20.8 ( 56528)  GAU D 0.50 25.3 ( 68683)  GGU G 0.18 11.4 ( 30898)
GUC V 0.22 13.6 ( 36917)  GCC A 0.32 22.9 ( 62202)  GAC D 0.50 24.9 ( 67783)  GGC G 0.31 19.7 ( 53631)
GUA V 0.12  7.8 ( 21277)  GCA A 0.26 19.0 ( 51713)  GAA E 0.43 31.0 ( 84178)  GGA G 0.27 17.6 ( 47765)
GUG V 0.45 28.2 ( 76624)  GCG A 0.13  9.1 ( 24768)  GAG E 0.57 40.9 (111123)  GGG G 0.25 16.0 ( 43513)

期望的输出:

{'A': {'GCA': '0.26', 'GCC': '0.32', 'GCU': '0.29', 'GCG': '0.13'}, 'C': {'UGC': '0.60', 'UGU': '0.40'}, 'E': {'GAG': '0.57', 'GAA': '0.43'}, 'D': {'GAU': '0.50', 'GAC': '0.50'}, 'G': {'GGU': '0.18', 'GGG': '0.25', 'GGA': '0.27', 'GGC': '0.31'}, 'F': {'UUU': '0.45', 'UUC': '0.55'}, 'I': {'AUA': '0.18', 'AUC': '0.46', 'AUU': '0.35'}, 'H': {'CAC': '0.60', 'CAU': '0.40'}, 'K': {'AAG': '0.56', 'AAA': '0.44'}, '*': {'UAA': '0.41', 'UGA': '0.59'}, 'M': {'AUG': '1.00'}, 'L': {'CUU': '0.13', 'CUG': '0.41', 'CUC': '0.18', 'CUA': '0.06', 'UUG': '0.13', 'UUA': '0.08'}, 'N': {'AAU': '0.43', 'AAC': '0.57'}, 'Q': {'CAA': '0.27', 'CAG': '0.72', 'UAG': '0.01'}, 'P': {'CCU': '0.27', 'CCG': '0.14', 'CCA': '0.28', 'CCC': '0.30'}, 'S': {'UCU': '0.18', 'AGC': '0.26', 'UCG': '0.07', 'UCC': '0.20', 'UCA': '0.15', 'AGU': '0.14'}, 'R': {'CGA': '0.10', 'CGC': '0.19', 'AGA': '0.22', 'AGG': '0.21', 'CGG': '0.18', 'CGU': '0.10'}, 'T': {'ACC': '0.31', 'ACA': '0.30', 'ACG': '0.14', 'ACU': '0.25'}, 'W': {'UGG': '1.00'}, 'V': {'GUC': '0.22', 'GUA': '0.12', 'GUG': '0.45', 'GUU': '0.21'}, 'Y': {'UAC': '0.60', 'UAU': '0.40'}}

1 个答案:

答案 0 :(得分:0)

# Read in the file to a single line.   
with open('cub.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

# split into individual items. 
data = data.split(" ")

result = {}

for index, item in enumerate(data):
    # Check if item is a codon. 
    if item.isalpha() and len(item) == 3:
        # Key is next item after codon.
        key = data[index+1]
        # Makes an internal hash if not done already, and inserts the result.
        if key not in result:
            result[key] = {}
        result[key][item] = data[index+2]

print(result)

请记住,这不会保持顺序(因为您希望结果是哈希值(不保持顺序)。

要维持订单,您可以使用OrderedDict

from collections import OrderedDict

# Read in the file to a single line.   
with open('cub.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '')

# split into individual items. 
data = data.split(" ")

result = {}

for index, item in enumerate(data):
    # Check if item is a codon. 
    if item.isalpha() and len(item) == 3:
        # Key is next item after codon.
        key = data[index+1]
        # Makes an internal hash if not done already, and inserts the result.
        if key not in result:
            result[key] = {}
        result[key][item] = data[index+2]

# Order keys.
ordered_result = OrderedDict(sorted(result.items(), key=lambda t: t[0]))

# Order internal hashes.
for key, value in ordered_result.items():
    ordered_result[key] = OrderedDict(sorted(value.items(), key=lambda t: t[0]))

print(ordered_result)