寻找一种优雅的方法将子字符串列表和它们之间的文本转换为字典中的键值对。例如:
s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1','k2','k3']
(missing code)
# s_dict = {'k1':'some text', 'k2':'more text', 'k3':'and still more'}
这可以使用str.find()
等解决,但我知道有一个更好的解决方案,而不是我一起黑客攻击。
答案 0 :(得分:13)
选项1
如果密钥没有空格或冒号,您可以使用dict
+ re.findall
(import re
,首先)简化解决方案:
>>> dict(re.findall('(\S+):(.*?)(?=\s\S+:|$)', s))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
只有冒号(:
)的位置决定了键/值的匹配方式。
<强>详情
(\S+) # match the key (anything that is not a space)
: # colon (not matched)
(.*?) # non-greedy match - one or more characters - this matches the value
(?= # use lookahead to determine when to stop matching the value
\s # space
\S+: # anything that is not a space followed by a colon
| # regex OR
$) # EOL
请注意,此代码假定问题中显示的结构。它将在具有无效结构的字符串上失败。
选项2
看马,没有正则表达式...
其运作方式与上述假设相同。
:
)
v = s.split(':')
v[1:-1] = [j for i in v[1:-1] for j in i.rsplit(None, 1)]
dict(zip(v[::2], v[1::2]))
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
答案 1 :(得分:7)
如果键中没有空格或冒号,您可以:
import re,itertools
s = 'k1:some text k2:more text k3:and still more'
toks = [x for x in re.split("(\w+):",s) if x] # we need to filter off empty tokens
# toks => ['k1', 'some text ', 'k2', 'more text ', 'k3', 'and still more']
d = {k:v for k,v in zip(itertools.islice(toks,None,None,2),itertools.islice(toks,1,None,2))}
print(d)
结果:
{'k2': 'more text ', 'k1': 'some text ', 'k3': 'and still more'}
使用itertools.islice
可以避免创建像toks[::2]
这样的子列表
答案 2 :(得分:5)
另一个正则表达式魔法,将输入字符串拆分为键/值对:
import re
s = 'k1:some text k2:more text k3:and still more'
pat = re.compile(r'\s+(?=\w+:)')
result = dict(i.split(':') for i in pat.split(s))
print(result)
输出:
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
re.compile()
并保存生成的正则表达式对象以便在单个程序中多次使用表达式时更有效\s+(?=\w+:)
- 用空格字符\s+
分隔输入字符串的关键模式,如果后跟“key”(单词\w+
}使用冒号:
)。(?=...)
- 代表前瞻性肯定主张答案 3 :(得分:1)
如果你有一个已知密钥的列表(也可能是值,但我没有在这个答案中解决这个问题),你可以使用正则表达式来完成。可能有一个快捷方式,例如,你可以简单地断言冒号之前的最后一个空格肯定表示密钥的开头,但这也应该有效:
import re
s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1', 'k2', 'k3']
dict_splitter = re.compile(r'(?P<key>({keys})):(?P<val>.*?)(?=({keys})|$)'.format(keys=')|('.join(key_list)))
result = {match.group('key'): match.group('val') for match in dict_splitter.finditer(s)}
print(result)
>> {'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}
Explanantion:
(?P<key>({keys})) # match all the defined keys, call that group 'key'
: # match a colon
(?P<val>.*?) # match anything that follows and call it 'val', but
# only as much as necessary..
(?=({keys})|$) # .. as long as whatever follows is either a new key or
# the end of the string
.format(keys=')|('.join(key_list))
# build a string out of the keys where all the keys are
# 'or-chained' after one another, format it into the
# regex wherever {keys} appears.
警告1:如果您的密钥可以包含彼此的顺序很重要,并且您可能希望从长密钥转到较短的密钥,以便先强制执行最长匹配:key_list.sort(key=len, reverse=True)
< / p>
警告2:如果您的密钥列表包含正则表达式元字符,它将破坏表达式,因此可能需要首先对其进行转义:key_list = [re.escape(key) for key in key_list]
答案 4 :(得分:1)
这个版本有点冗长但很简单,它不需要任何库并考虑key_list
:
def substring_to_dict(text, keys, key_separator=':', block_separator=' '):
s_dict = {}
current_key = None
for block in text.split(block_separator):
if key_separator in block:
key, word = block.split(key_separator, 1)
if key in keys:
current_key = key
block = word
if current_key:
s_dict.setdefault(current_key, []).append(block)
return {key:block_separator.join(s_dict[key]) for key in s_dict}
以下是一些例子:
>>> keys = {'k1','k2','k3'}
>>> substring_to_dict('k1:some text k2:more text k3:and still more', keys)
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more'}
>>> substring_to_dict('k1:some text k2:more text k3:and still more k4:not a key', keys)
{'k1': 'some text', 'k2': 'more text', 'k3': 'and still more k4:not a key'}
>>> substring_to_dict('', keys)
{}
>>> substring_to_dict('not_a_key:test', keys)
{}
>>> substring_to_dict('k1:k2:k3 k2:k3:k1', keys)
{'k1': 'k2:k3', 'k2': 'k3:k1'}
>>> substring_to_dict('k1>some;text;k2>more;text', keys, '>', ';')
{'k1': 'some;text', 'k2': 'more;text'}
答案 5 :(得分:0)
这并不是一个好主意,但为了完整起见,在这种情况下使用ast.literal_eval
也是一种选择:
from ast import literal_eval
s = 'k1:some text k2:more text k3:and still more'
key_list = ['k1','k2','k3']
s_ = s
for k in key_list:
s_ = s_.replace('{}:'.format(k), '","{}": "'.format(k))
s_dict = literal_eval('{{{}"}}'.format(s_[2:]))
print(s_dict)
输出:
{'k1': 'some text ', 'k2': 'more text ', 'k3': 'and still more'}