大家好,我正在尝试将一个非常好的字符串解析成它的组件。字符串非常像JSON,但严格来说不是JSON。他们是这样形成的:
createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source="Region", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
输出就像文本块一样,此时不需要做任何特别的事情。
createdAt=Fri Aug 24 09:48:51 EDT 2012
id=238996293417062401
text='Test Test'
source="Region"
entities=[foo, bar]
user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}
使用以下表达式,我可以分离出大部分字段
,(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))(?=(?:[^']*'[^']*')*(?![^']*'))
哪个逗号分隔所有逗号都不是任何类型的引号,但我似乎无法跳到逗号分割的位置,而不是括号或大括号。
答案 0 :(得分:2)
因为你想处理嵌套的parens /括号,处理它们的“正确”方法是单独标记它们,并跟踪你的嵌套级别。因此,对于不同的令牌类型,您确实需要多个正则表达式而不是单个正则表达式。
这是Python,但转换为Java应该不会太难。
# just comma
sep_re = re.compile(r',')
# open paren or open bracket
inc_re = re.compile(r'[[(]')
# close paren or close bracket
dec_re = re.compile(r'[)\]]')
# string literal
# (I was lazy with the escaping. Add other escape sequences, or find an
# "official" regex to use.)
chunk_re = re.compile(r'''"(?:[^"\\]|\\")*"|'(?:[^'\\]|\\')*[']''')
# This class could've been just a generator function, but I couldn;'t
# find a way to manage the state in the match function that wasn't
# awkward.
class tokenizer:
def __init__(self):
self.pos = 0
def _match(self, regex, s):
m = regex.match(s, self.pos)
if m:
self.pos += len(m.group(0))
self.token = m.group(0)
else:
self.token = ''
return self.token
def tokenize(self, s):
field = '' # the field we're working on
depth = 0 # how many parens/brackets deep we are
while self.pos < len(s):
if not depth and self._match(sep_re, s):
# In Java, change the "yields" to append to a List, and you'll
# have something roughly equivalent (but non-lazy).
yield field
field = ''
else:
if self._match(inc_re, s):
depth += 1
elif self._match(dec_re, s):
depth -= 1
elif self._match(chunk_re, s):
pass
else:
# everything else we just consume one character at a time
self.token = s[self.pos]
self.pos += 1
field += self.token
yield field
用法:
>>> list(tokenizer().tokenize('foo=(3,(5+7),8),bar="hello,world",baz'))
['foo=(3,(5+7),8)', 'bar="hello,world"', 'baz']
此实现需要一些快捷方式:
\"
和单引号字符串中的\'
。这很容易解决。depth
更改为某种堆栈并将其推/弹parens /括号添加到其中。答案 1 :(得分:1)
您可以使用以下正则表达式来匹配所需的块,而不是分割逗号。
(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)
的Python:
import re
text = "createdAt=Fri Aug 24 09:48:51 EDT 2012, id=238996293417062401, text='Test Test', source=\"Region\", entities=[foo, bar], user={name=test, locations=[loc1,loc2], locations={comp1, comp2}}"
re.findall(r'(?:^| )(.+?)=(\{.+?\}|\[.+?\]|.+?)(?=,|$)', text)
>> [
('createdAt', 'Fri Aug 24 09:48:51 EDT 2012'),
('id', '238996293417062401'),
('text', "'Test Test'"),
('source', '"Region"'),
('entities', '[foo, bar]'),
('user', '{name=test, locations=[loc1,loc2], locations={comp1, comp2}}')
]
我已经设置了分组,因此它会将“密钥”和“值”分开。它将在Java中执行相同的操作 - 请参阅此处的Java工作:
http://www.regexplanet.com/cookbook/ahJzfnJlZ2V4cGxhbmV0LWhyZHNyDgsSBlJlY2lwZRj0jzQM/index.html
正则表达式解释:
(?:^| )
与行首或空格匹配的非捕获组(.+?)
在...之前匹配“密钥”。=
等号(\{.+?\}|\[.+?\]|.+?)
匹配一组{
字符}
,[
个字符]
,或最后只是字符(?=,|$)
展望未来匹配,
或行尾。