Python regex issues - Attempting to parse string

时间:2016-04-21 22:31:33

标签: python json regex string parsing

I want to take a string like this:

enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''

and convert it into a JSON type format:

{'enabled': 'false', 'script': 'var name=\'Bob\'\\n ', 'index': '0', 'value': ''}

but I cannot for the life of me figure out a regex or a combination of splitting the string that will produce the result.

The values can have any specials characters in the them and will always escape single quotes and backslashes.

Is there any way to get the regex in Python to stop after finding the first match?

For example, this:

import re
re.findall('[a-zA-Z0-9]+=\'.*\'', line)

will match the entire string instead and won't stop at

['stripPrefix=\'false\'', ....]

like I would like it to.

4 个答案:

答案 0 :(得分:1)

首先,我假设您的输入字符串中有错误:引用“Bob”之前的引用。

如果我的假设是正确的,我会使用这样的正则表达式代码:

>>> line = r"""enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''"""
>>> re.findall(r"([a-zA-Z]*)='((?:[^'\\]|\\.)*)'\s*", line)
[('enabled', 'false'), ('script', "var name=\\'Bob\\'\\\\n "), ('index', '0'), ('value', '')]
  • [^'\\]匹配除引号和反斜杠之外的任何符号
  • \\.匹配反斜杠一个符号
  • ([^'\\]|\\.)匹配以前的任何一种情况
  • (?:[^\\]|\\.)执行相同操作,但不会将匹配项捕获到结果中(请检查https://docs.python.org/2.7/library/re.html
  • (?:[^'\\]|\\.)*重复任何时候

答案 1 :(得分:0)

>>> line = "enabled='false' script='var name=\\'Bob\\'\\n \\\\' index='0' value=''"
>>> print line
enabled='false' script='var name=\'Bob\'\n \\' index='0' value=''
>>> groups = re.findall(r"([a-zA-Z0-9]+)='((?:\\.|[^\'])*)'", line)
>>> for name, value in groups:
...     print name
...     print value
... 
enabled
false
script
var name=\'Bob\'\n \\
index
0
value

>>> import json
>>> print json.dumps(dict(groups))
{"index": "0", "enabled": "false", "value": "", "script": "var name=\\'Bob\\'\\n \\\\"}

The regex is based on this answer.

Note that Python strings can use either single or double quotes. If your string literal contains one of those, use the other. If it contains both, use triple quotes: """. This way you don't have to awkwardly escape the quotes. The r prefix denotes a raw string and also lets you cut down on escaping: in this case it allows me to write \\ instead of \\\\!

答案 2 :(得分:0)

首先,我假设您的示例输入缺少反斜杠以转义Bob之前的单引号。

其次,提供的预期输出不是严格的json,因为json使用双引号。我的解决方案将为您提供标准的json字符串。

我选择了一种方法将字符串正确解析到内存中,然后将其序列化为json,而不是尝试直接将其转换为json。正则表达式和unescape部分匹配输入中的键值对,并替换值中的转义字符以具有值的精确字符串表示形式。此时,甚至可以构建这些值的python字典,并将其转储到json。不幸的是,python dicts不保留插入顺序,因此输出具有随机的条目顺序。要保持顺序,请将解析后的值视为键值对流,并使用自定义json序列化程序,如下所示:

import re
import json

ESCAPES = {
  "n": "\n",
  "t": "\t",
  # ...
}
def _escapematch(m):
  x = m.group(1)
  return ESCAPES.get(x, x)

def unescape(literal):
  return re.sub(r"\\(.)", _escapematch, literal)

def parse_pairs(line):
  return (
    (key, unescape(val))
    for key, val in
    re.finditer(r"([a-zA-Z0-9]+)='((?:[^\\']|\\.)*)'", line)
  )

def convert_to_json(line):
  return json.dumps(dict(parse_pairs(line)))

def dumps_json_object(o):
  return "{" + ", ".join(
    json.dumps(k) + ": " + json.dumps(v)
    for k,v in o
  ) + "}" 

def convert_to_json_keep_order(line):
  return dumps_json_object(parse_pairs(line))

line = """
enabled='false' script='var name=\\'Bob\\'\\\\n ' index='0' value=''
"""

print(convert_to_json(line))
# {"value": "", "enabled": "false", "index": "0", "script": "var name='Bob'\\n "}
# Note the random order at every execution

print(convert_to_json_keep_order(line))
# {"enabled": "false", "script": "var name='Bob'\\n ", "index": "0", "value": ""}

答案 3 :(得分:0)

Pyparsing在这里非常有用,特别是如果您获得更复杂的输入。请参阅以下源代码中的注释:

from pyparsing import *

EQ = Suppress('=')
key = Word(alphas, alphanums)
value = QuotedString("'", escChar="\\")
parser = OneOrMore(Group(key + EQ + value))

# multiplication with an integer or tuple works too
#  parser = 4 * Group(key + EQ + value)
#  ONE_OR_MORE = (1,)
#  parser = ONE_OR_MORE * Group(key + EQ + value)


sample = r"""
    enabled='false' script='var name=\'Bob\'\\n ' index='0' value=''
"""

# parse the sample string
res = parser.parseString(sample)

# pretty-print parsed results
res.pprint()

# convert results to list and make a dict from it
print(dict(res.asList()))


# alternatively, make the parser do the dict-building
parser = Dict(OneOrMore(Group(key + EQ + value)))
res = parser.parseString(sample)

# parsed results look like a list
res.pprint()

# but Dict will define key-values to make a dict-like return object
print(res.dump())
print(res['enabled'])
print(res.keys())

# or access fields using object.attribute notation
print(res.enabled)

打印:

[['enabled', 'false'],
 ['script', "var name='Bob'\\\n "],
 ['index', '0'],
 ['value', '']]

{'index': '0', 'enabled': 'false', 'value': '', 'script': "var name='Bob'\\\n "}

[['enabled', 'false'],
 ['script', "var name='Bob'\\\n "],
 ['index', '0'],
 ['value', '']]

[['enabled', 'false'], ['script', "var name='Bob'\\\n "], ['index', '0'], ['value', '']]
- enabled: false
- index: 0
- script: var name='Bob'\

- value: 

false

['index', 'enabled', 'value', 'script']

false