我如何解析以下输入(逐行或通过正则表达式或两者的组合):
Table[
Row[
C_ID[Data:12345.0][Sec:12345.0][Type:Double]
F_ID[Data:17660][Sec:17660][Type:Long]
NAME[Data:Mike Jones][Sec:Mike Jones][Type:String]
]
Row[
C_ID[Data:2560.0][Sec:2560.0][Type:Double]
...
]
]
当然还有缩进,所以它可以被\ n \ t分割(然后清除C_ID,F_ID行中的额外标签\ t等等......
所需的输出在python中更有用:
{'C_ID': 12345, 'F_ID': 17660, 'NAME': 'Mike Jones',....} {'C_ID': 2560, ....}
我尝试过逐行,然后使用多个splits()扔掉我不需要的东西并保留我需要的东西,但我确信有一种更优雅,更快捷的方式这样做......
答案 0 :(得分:3)
答案 1 :(得分:1)
这里确实没有很多不可预测的嵌套,所以你可以用正则表达式做到这一点。但是pyparsing是我的首选工具,所以这是我的解决方案:
from pyparsing import *
LBRACK,RBRACK,COLON = map(Suppress,"[]:")
ident = Word(alphas, alphanums+"_")
datatype = oneOf("Double Long String Boolean")
# define expressions for pieces of attribute definitions
data = LBRACK + "Data" + COLON + SkipTo(RBRACK)("contents") + RBRACK
sec = LBRACK + "Sec" + COLON + SkipTo(RBRACK)("contents") + RBRACK
type = LBRACK + "Type" + COLON + datatype("datatype") + RBRACK
# define entire attribute definition, giving each piece its own results name
attrDef = Group(ident("key") + data("data") + sec("sec") + type("type"))
# now a row is just a "Row[" and one or more attrDef's and "]"
rowDef = Group("Row" + LBRACK + Group(OneOrMore(attrDef))("attrs") + RBRACK)
# this method will process each row, and convert the key and data fields
# to addressable results names
def assignAttrs(tokens):
ret = ParseResults(tokens.asList())
for attr in tokens[0].attrs:
# use datatype mapped to function to convert data at parse time
value = {
'Double' : float,
'Long' : int,
'String' : str,
'Boolean' : bool,
}[attr.type.datatype](attr.data.contents)
ret[attr.key] = value
# replace parse results created by pyparsing with our own named results
tokens[0] = ret
rowDef.setParseAction(assignAttrs)
# a TABLE is just "Table[", one or more rows and "]"
tableDef = "Table" + LBRACK + OneOrMore(rowDef)("rows") + RBRACK
test = """
Table[
Row[
C_ID[Data:12345.0][Sec:12345.0][Type:Double]
F_ID[Data:17660][Sec:17660][Type:Long]
NAME[Data:Mike Jones][Sec:Mike Jones][Type:String]
]
Row[
C_ID[Data:2560.0][Sec:2560.0][Type:Double]
NAME[Data:Casey Jones][Sec:Mike Jones][Type:String]
]
]"""
# now parse table, and access each row and its defined attributes
results = tableDef.parseString(test)
for row in results.rows:
print row.dump()
print row.NAME, row.C_ID
print
打印:
[[[['C_ID', 'Data', '12345.0', 'Sec', '12345.0', 'Type', 'Double'],...
- C_ID: 12345.0
- F_ID: 17660
- NAME: Mike Jones
Mike Jones 12345.0
[[[['C_ID', 'Data', '2560.0', 'Sec', '2560.0', 'Type', 'Double'], ...
- C_ID: 2560.0
- NAME: Casey Jones
Casey Jones 2560.0
assignAttrs中指定的结果名称使您可以按名称访问每个属性。要查看是否省略了名称,只需测试“if not row.F_ID:”。
答案 2 :(得分:0)
这个优秀的page列出了Python程序员可用的许多解析器。正则表达式不适合“平衡括号”匹配,但该页面上审查的任何第三方软件包都将为您提供良好的服务。
答案 3 :(得分:-1)
这个正则表达式:
Row\[[\s]*C_ID\[[\W]*Data:([0-9.]*)[\S\W]*F_ID\[[\S\W]*Data:([0-9.]*)[\S\W]*NAME\[[\S\W]*Data:([\w ]*)[\S ]*
第一行的将匹配:
$ 1 = 12345.0 $ 2 = 17660 $ 3 = Mike Jones
然后你可以使用这样的东西:
{'C_ID': $1, 'F_ID': $2, 'NAME': '$3'}
生产:
{'C_ID': 12345.0, 'F_ID': 17660, 'NAME': 'Mike Jones'}
所以你需要迭代你的输入,直到它停止匹配你的行... 它有意义吗?