从字符串中提取值

时间:2014-05-13 22:01:16

标签: python

我想从python中的字符串中提取某些值。

snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1

输出:

              GENE_ID         GENE_NAME   EXON_NUMBER  SEVERE_IMPACT
snp_1_881627  ENSG00000188976 NOC2L       16/19        SYNONYMOUS_CODON

如果字符串包含每个变量(GENE_ID,GENE_NAME,EXON_NUMBER)的值,则输出,否则为“NA”(变量不存在或它们的值不存在)。在某些情况下,这些变量不会不存在于字符串中。

我应该使用哪种字符串方法来完成此操作?我应该在提取任何值之前拆分我的字符串吗?我有10k行来为每个snp提取值_ *

string=string.split(';')

P.S。我是python中的新手

2 个答案:

答案 0 :(得分:2)

有两种常规策略 - split和正则表达式。

要使用split,请首先拆分行标签(snp_1_881627):

rowname, data = row.split()

然后,您可以使用data分隔符将;拆分为各个条目:

data = data.split(';')

由于您需要获取某些键的值,我们可以将其转换为字典:

dataDictionary = {}
for entry in data:
    entry = entry.split('=')
    dataDictionary[entry[0]] = entry[1] if len(entry) > 1 else None

然后,您只需检查密钥是否在dataDictionary,如果是,请抓住他们的值。

使用split很不错,因为它会为数据字符串中的所有内容编制索引,从而可以轻松地抓取您需要的任何内容。

如果你需要的那些不会改变,那么正则表达式可能是更好的选择:

>>> import re
>>> re.search('(?<=GENE_ID=)[^;]*', 'onevalue;GENE_ID=SOMETHING;othervalue').group()
'SOMETHING'

我在这里使用&#34; lookbehind&#34;匹配其中一个关键字,然后使用group()从匹配中获取值。将关键字放入列表中,您可以找到所有这些值:

import re
...
keywords = ['GENE_ID', 'GENE_NAME', 'EXON_NUMBER', 'SEVERE_IMPACT']
desiredValues = {}
for keyword in keywords:
    match = re.search('(?<={}=)[^;]*'.format(keyword), string_to_search)
    desiredValues[keyword] = match.group() if match else DEFAULT_VALUE

答案 1 :(得分:0)

我认为这将是您正在寻找的解决方案。

#input
user_in = 'snp_1_881627    AA=G;ALLELE=A;DAF_GLOBAL=0.473901;GENE_TRCOUNT_AFFECTED=1;GENE_TRCOUNT_TOTAL=1;SEVERE_GENE=ENSG00000188976;SEVERE_IMPACT=SYNONYMOUS_CODON;TR_AFFECTED=FULL;ANNOTATION_CLASS=REG_FEATURE,SYNONYMOUS_CODON,ACTIVE_CHROM,NC_TRANSCRIPT_VARIANT,NC_TRANSCRIPT_VARIANT;A_A_CHANGE=.,L,.,.,.;A_A_LENGTH=.,750,.,.,.;A_A_POS=.,615,.,.,.;CELL=GM12878,.,GM12878,.,.;CHROM_STATE=.,.,11,.,.;EXON_NUMBER=.,16/19,.,.,.;GENE_ID=.,ENSG00000188976,.,ENSG00000188976,ENSG00000188976;GENE_NAME=.,NOC2L,.,NOC2L,NOC2L;HGVS=.,c.1843N>T,.,n.3290N>T,n.699N>T;REG_ANNOTATION=H3K36me3,.,.,.,.;TR_BIOTYPE=.,PROTEIN_CODING,.,PROCESSED_TRANSCRIPT,PROCESSED_TRANSCRIPT;TR_ID=.,ENST00000327044,.,ENST00000477976,ENST00000483767;TR_LENGTH=.,2790,.,4201,1611;TR_POS=.,1893,.,3290,699;TR_STRAND=.,-1,.,-1,-1'

#set some empty vars
user_in = user_in.split(';')
final_output = ""
GENE_ID_FOUND = False
GENE_NAME_FOUND = False
EXON_NUMBER_FOUND = False
GENE_ID_OUTPUT = ''
GENE_NAME_OUTPUT = ''
EXON_NUMBER_OUTPUT = ''
SEVERE_IMPACT_OUTPUT = ''


for x in range(0, len(user_in)):
  if x == 0:
    first_line_count = 0
    first_line_print = ''
    while(user_in[0][first_line_count] != " "):
      first_line_print += user_in[0][first_line_count]
      first_line_count += 1
    final_output += first_line_print + "\t"
  else:

    if user_in[x][0:11] == "SEVERE_GENE":
      GENE_ID_OUTPUT += user_in[x][12:] + "\t"
      GENE_ID_FOUND = True

    if user_in[x][0:9] == "GENE_NAME":
      GENE_NAME_OUTPUT += user_in[x][10:] + "\t"
      GENE_NAME_FOUND = True

    if user_in[x][0:11] == "EXON_NUMBER":
      EXON_NUMBER_OUTPUT += user_in[x][12:] + "\t"
      EXON_NUMBER_FOUND = True

    if user_in[x][0:13] == "SEVERE_IMPACT":
      SEVERE_IMPACT_OUTPUT += user_in[x][14:] + "\t"

if GENE_ID_FOUND == True:
  final_output += GENE_ID_OUTPUT
else:
  final_output += "NA"

if GENE_NAME_FOUND == True:
  final_output += GENE_NAME_OUTPUT
else:
  final_output += "NA"

if EXON_NUMBER_FOUND == True:
  final_output += EXON_NUMBER_OUTPUT
else:
  final_output += "NA"

final_output += SEVERE_IMPACT_OUTPUT


print(final_output)