Question

我有两种类型的字符串，如下所示

string1 = 'ID=mRNA42;Parent=gene19;integrity=0.95;foo=bar'
string2 = 'transcript_id "g3.t1"; gene_id "g3";'

我正在尝试创建一个函数，它将上述字符串作为输入并根据字符串返回字典。

对于string1字典，结构类似于

attributes = {
    'ID': 'mrna42',
    'Parent': 'gene19',
    'integrity': '0.95',
    'foo': 'bar',
}

和string2

attributes = {
    'transcript_id': 'g3.t1',
    'gene_id': 'g3', 
}

我的尝试：

def parse_single_feature_line(attributestring):

    attributes = dict()
    for keyvaluepair in attributestring.split(';'):
        for key, value in keyvaluepair.split('='):
            attributes[key] = value
    return attributes

我需要帮助才能构建这个功能。

Answer 1

试试这个

string1 = 'ID=mRNA42;Parent=gene19;integrity=0.95;foo=bar'
string2 = 'transcript_id "g3.t1"; gene_id "g3";'

def str2dict(s):
     result={}
     for i in s.split(";"):
             ele=i.strip()
             if not ele:continue
             if "=" in i:
                     key,val=ele.split("=")
             else:   
                     key,val=ele.split()
             result[key]=val.strip('"')
     return result

str2dict(string1)
str2dict(string2)

Answer 2

你可以使用dict理解！

>>> string1
'ID=mRNA42;Parent=gene19;integrity=0.95;foo=bar'
>>> string2
'transcript_id "g3.t1"; gene_id "g3";'
>>> {each.split('=')[0]:each.split('=')[1] for each in string1.split(';') if each}
{'foo': 'bar', 'integrity': '0.95', 'ID': 'mRNA42', 'Parent': 'gene19'}
>>> {each.split(' ')[0]:each.split(' ')[1] for each in string2.split(';') if each}
{'': 'gene_id', 'transcript_id': '"g3.t1"'}

要解决你所面临的问题，

def parse_single_feature_line(attributestring):
    attributes = dict()
    for keyvaluepair in attributestring.split(';'):
        key,value=keyvaluepair.split('=') # you get a list  when you split keyvaluepair string and not a list of list(if list of lists eg.[["this","these"],["that","those"]] then you can use - for key,value in list_of_lists:)
        attributes[key] = value
    return attributes

print parse_single_feature_line(string1)

Answer 3

您可以拥有正则表达式的全局解决方案：

import re

string1 = 'ID=mRNA42;Parent=gene19;integrity=0.95;foo=bar'
string2 = 'transcript_id "g3.t1"; gene_id "g3";'

# Define the regular expression
reg_exp = "([\.\-\w_]+)=([\.\-\w_]+);?|([\.\-\w_]+) \"([\.\-\w_]+)\""

# Get results and filter empty elements in tuples
match = [filter(None, x) for x in re.findall(reg_exp, string1+"\n"+string2)]

# Convert to dict
result = {key:value for key, value in match}

此正则表达式包含两个主要组：

A组([\.\-\w_]+)=([\.\-\w_);?和B组([\.\-\w_]+) \"([\.\-\w_]+)\"

每个组包含另外两个组，它们将与名称和值对匹配。请注意，您可能需要将这些组调整为预期的名称和值，或使用(.*?)

Answer 4

它们不同，因此需要处理不同。

def return_dict(string):
    if "=" in string:
        return dict(i.strip().split("=") for i in string.split(";"))
    else:
        return dict([i.strip().split(" ") for i in string.split(";") if len(i.strip().split(" ")) > 1])

return_dict(string1)
return_dict(string2)

给出：

{'ID': 'mRNA42', 'Parent': 'gene19', 'foo': 'bar', 'integrity': '0.95'}
{'gene_id': '"g3"', 'transcript_id': '"g3.t1"'}

Answer 5

第一个解决方案：拆分空间并在结果的后半部分删除引号：

>>> key, val = 'transcript_id "g3.t1"'.split(" ", maxsplit=1)
>>> val = val.strip('"')
>>> key
'transcript_id'
>>> val
'g3.t1'

第二种解决方案（更通用）：使用正则表达式捕获部分：

>>> import re
>>> match = re.search(r'([a-z_]+) "(.+?)"', 'transcript_id "g3.t1"')
>>> key, val = match.groups()
>>> key
'transcript_id'
>>> val
'g3.t1'

如果您事先知道在给定的字符串或文件中有两种格式，则可以传递回调来进行子字符串解析，即：

def parse_line(attributestring, itemparse):
    attributes = dict()
    for keyvaluepair in attributestring.split(';'):
        if not keyvaluepair:
            # empty string due to a trailing ";"
            continue   
        for key, value in itemparse(keyvaluepair):
            attributes[key] = value
    return attributes


def parse_eq(kvstring):
    return kvstring.split("=")

def parse_space(kvstring):
    key, val = 'transcript_id "g3.t1"'.split(" ", maxsplit=1)
    return key, val.strip('"')

d1 = parse_line(string1, parse_eq)
d2 = parse_line(string2, parse_space)

Answer 6

简化版，您可以添加分隔符以在正则表达式中拆分以进行更多字符串拆分，

string1 = 'ID=mRNA42;Parent=gene19;integrity=0.95;foo=bar'
string2 = 'transcript_id "g3.t1"; gene_id "g3";'
import re

def parse_single_feature_line(string):
    attributes = dict(re.split('[ =]', i.strip()) for i in string.split(';') if i)
    return attributes

print parse_single_feature_line(string1)
print parse_single_feature_line(string2)

如何解析自定义字符串并从该字符串创建字典？

6 个答案: