从更大的字符串中提取所需子字符串的pythonic方法

时间:2017-07-07 08:11:38

标签: python regex python-2.7 python-3.x

我有一个像这样的字符串

msg = b'@\x06string\x083http://schemas.microsoft.com/2003/10/Serialization/\x9a\x05\x18{"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}\x01'

字符串{"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}json parsable。所以我想出了以下代码来从上面的msg

中删除垃圾字符串
x1 =  msg.split(b'{"',1)[1]
>>> 
>>> x1
b'PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}\x01'
x2 = x1[::-1].split(b'}"', 1)[1][::-1]
>>> x2
b'PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,'
>>> final_msg = b'{"%s"}'%x2
>>> final_msg
b'{"PUID":"9279565","Title":"Risk Manager","Description":"<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,"}'
>>> import json
>>> json.loads(final_msg)
{'Description': "<strong>Risk Manager </strong><br />\\n<br />\\nLentech, Inc. is currently seekinga Risk Manager inGreenbelt,'", 'Title': 'Risk Manager', "b'PUID": '9279565'}

这是一种做所需事情的坏方法,我想知道一种更优化的方法来实现结果。我认为正则表达式在这里很有用,但我对正则表达式的知识非常有限。

提前致谢

2 个答案:

答案 0 :(得分:1)

你去了:

import re
final_msg = re.search("{.*}", msg).group(0)

答案 1 :(得分:0)

您可以先将字节类型转换为字符串类型

msg = str(msg)

之后你可以编写一个生成器函数和枚举来拉出你要搜索的符号的索引

def gen_index(a_string):
    for i,symbol in enumerate(a_string):
        if symbol == '{':
            yield i
    for j , symbol in enumerate(a_string):
       if symbol == '}':
           yield j

 >>>a = list(gen_index(msg))  # returns the array
 >>># use array slicing to output to json. We need the first occurance of '{' and the last occurance of '}'
 import json
 json_output = json.loads(msg[a[0]:a[-1]+1])