我有一个这样的清单:
['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding',
'>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']
我想创建一个具有相同维度和顺序的新列表,但在新列表中我将仅保留基因ID。结果将是这样的:
['ENSG00000103091', 'ENSG00000196313']
我正在使用python。你们知道怎么做吗?感谢
答案 0 :(得分:1)
使用一些基本的列表理解:
lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']
res = [el[5:] for s in lst for el in s.split() if el.startswith('gene:')]
如果您更喜欢使用常规for循环,请使用:
lst = ['>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding', '>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding']
res = []
for el in lst: # for each string in your list
l = el.split() # create a second list, of split strings
for s in l: # for each string in the 'split strings' list
if s.startswith('gene:'): # if the string starts with 'gene:' we know we have match
res.append(s[5:]) # so skip the 'gene:' part of the string, and append the rest to a list
答案 1 :(得分:0)
For each string in the list:
Split the string on spaces (Python **split** command)
Find the element starting with "gene:"
Keep the rest of the string (grab the slice [5:] of that element)
你有足够的基本Python知识从那里获取它吗?如果没有,我建议您咨询string method documentation。
答案 2 :(得分:0)
这绝不是实现这一目标的最恐怖的方式,但它应该做你想要的。
l = [
'>ENST00000262144 cds:known chromosome:GRCh37:16:74907468:75019046:-1 gene:ENSG00000103091 gene_biotype:protein_coding transcript_biotype:protein_coding',
'>ENST00000446813 cds:known chromosome:GRCh37:7:72349936:72419009:1 gene:ENSG00000196313 gene_biotype:protein_coding transcript_biotype:protein_coding'
]
genes = []
for e in l:
e = e.split('gene:')
gene = ''
for c in e[1]:
if c != ' ':
gene += c
else:
break
genes.append(gene)
print(genes)
循环遍历列表中的元素,然后在gene:
之后将它们拆分,然后将所有字符附加到字符串并将其添加到数组中。