我正在尝试解析固定宽度的平面文件(标题和详细信息类型记录),该文件没有重复/定义的标记值来标识段。当我尝试在Anypoint Studio中处理文件(简单转换为json格式)时,我收到一条错误消息" java.lang.IllegalStateException:Segment not defined" 。我理解架构需要修复,但我没有想法尝试。
我很感激,如果有人能够在Anypoint studio的观点中指出它有什么问题。
架构:
form: FIXEDWIDTH
structures:
- id: 'flatfile'
name: flatfile
tagStart: 0
data:
- { idRef: 'Header', count: 1}
- { idRef: 'Items', count: 99, usage: O}
segments:
- id: 'Header'
name: Header
values:
- { name: 'PCBCode', type: String, length: 8 }
- { name: 'NumberTG', type: String, length: 17 }
- { name: 'TopSort', type: String, length: 1 }
- { name: 'InternalRef', type: String, length: 5 }
- { name: 'DateInt', type: String, length: 26 }
- { name: 'DAT', type: String, length: 26 }
- { name: 'DIN', type: String, length: 26 }
- { name: 'DLN', type: String, length: 26 }
- { name: 'DON', type: String, length: 26 }
- { name: 'Sort', type: String, length: 10 }
- { name: 'NameCharter', type: String, length: 35 }
- { name: 'NumberReg', type: String, length: 17 }
- { name: 'NatTruck', type: String, length: 3 }
- { name: 'NumRemarks', type: String, length: 17 }
- { name: 'NatRemarks', type: String, length: 3 }
- { name: 'Weight', type: String, length: 6 }
- { name: 'Remarks', type: String, length: 35 }
- id: 'Items'
name: Items
values:
- { name: 'TVNum', type: String, length: 17 }
- { name: 'Load', type: String, length: 1 }
- { name: 'Flag', type: String, length: 1 }
- { name: 'col', type: String, length: 17 }
下面的样本数据长度为4000
BCD_VAN 180223G04467 N377612018-02-23-13.57.15.7722282018-02-26-13.21.26.3305841901-01-01-00.00.00.0000001901-01-01-00.00.00.0000001901-01-01-00.00.00.000000 TAURUS W1TRS19 PL WWL72142 PL 000000 G18GKJ99-690851 G18GKJ96-690851 G18GKJ22-685131 G18GKJ00-668701 G18GGX99-668701
答案 0 :(得分:0)
通过Python切片的神奇之处,固定宽度数据易于处理。切片是可用于从序列中“切片”碎片的对象,无论是字符串,列表,元组还是支持索引编址的任何其他序列。想象一下,你有字符串rec = "BLAHXXIMPORTANT DATAXXBLAH"
。您可以使用rec[6:20]
提取重要数据。您还可以使用data_slice = slice(6, 20)
创建切片,然后使用rec
从rec[data_slice]
获取值。
以下是样本数据记录的提取器,它通过解析字段规范来创建切片及其关联名称:
layout = """\
- { name: 'PCBCode', type: String, length: 8 }
- { name: 'NumberTG', type: String, length: 17 }
- { name: 'TopSort', type: String, length: 1 }
- { name: 'InternalRef', type: String, length: 5 }
- { name: 'DateInt', type: String, length: 26 }
- { name: 'DAT', type: String, length: 26 }
- { name: 'DIN', type: String, length: 26 }
- { name: 'DLN', type: String, length: 26 }
- { name: 'DON', type: String, length: 26 }
- { name: 'Sort', type: String, length: 10 }
- { name: 'NameCharter', type: String, length: 35 }
- { name: 'NumberReg', type: String, length: 17 }
- { name: 'NatTruck', type: String, length: 3 }
- { name: 'NumRemarks', type: String, length: 17 }
- { name: 'NatRemarks', type: String, length: 3 }
- { name: 'Weight', type: String, length: 6 }
- { name: 'Remarks', type: String, length: 35 }
"""
# build data slicer - list of names and slices for each field in the fixed format input
slicer = []
cur = 0
for line in layout.splitlines():
# split the line on whitespace, will give a list like:
# ['-', '{', 'name:', "'PCBCode',", 'type:', 'String,', 'length:', '8', '}']
# the name is in element 3 (we start with 0), and the integer length
# is second from last, so we can use index -2 to get it
parts = line.split()
if not parts:
continue
slice_name = parts[3].strip("',")
slice_len = int(parts[-2])
slicer.append((slice_name, slice(cur, cur+slice_len)))
cur += slice_len
# print out the names and slices
for slc in slicer:
print(slc)
print()
打印:
('PCBCode', slice(0, 8, None))
('NumberTG', slice(8, 25, None))
('TopSort', slice(25, 26, None))
('InternalRef', slice(26, 31, None))
('DateInt', slice(31, 57, None))
('DAT', slice(57, 83, None))
('DIN', slice(83, 109, None))
('DLN', slice(109, 135, None))
('DON', slice(135, 161, None))
('Sort', slice(161, 171, None))
('NameCharter', slice(171, 206, None))
('NumberReg', slice(206, 223, None))
('NatTruck', slice(223, 226, None))
('NumRemarks', slice(226, 243, None))
('NatRemarks', slice(243, 246, None))
('Weight', slice(246, 252, None))
('Remarks', slice(252, 287, None))
现在你可以使用切片(就像你用来索引成(start, end, step)
字符串的小data[start:end:step]
三元组一样)及其相关名称来构建一个字典。
# a simple method to slice up a fixed format data line with a slicer, strips trailing spaces from fields
def extract(slicer, data_line):
return {name: data_line[data_slice].strip() for name, data_slice in slicer}
您的数据看起来如何:
# try it out
data = "BCD_VAN 180223G04467 N377612018-02-23-13.57.15.7722282018-02-26-13.21.26.3305841901-01-01-00.00.00.0000001901-01-01-00.00.00.0000001901-01-01-00.00.00.000000 TAURUS W1TRS19 PL WWL72142 PL 000000 G18GKJ99-690851 G18GKJ96-690851 G18GKJ22-685131 G18GKJ00-668701 G18GGX99-668701 "
data_dict = extract(slicer, data)
# output as JSON
import json
print(json.dumps(data_dict, indent=2))
打印:
{
"Remarks": "",
"PCBCode": "BCD_VAN",
"NatTruck": "PL",
"DateInt": "2018-02-23-13.57.15.772228",
"DAT": "2018-02-26-13.21.26.330584",
"NumRemarks": "WWL72142",
"Weight": "000000",
"DIN": "1901-01-01-00.00.00.000000",
"Sort": "",
"InternalRef": "37761",
"NumberTG": "180223G04467",
"DLN": "1901-01-01-00.00.00.000000",
"TopSort": "N",
"NatRemarks": "PL",
"NumberReg": "W1TRS19",
"NameCharter": "TAURUS",
"DON": "1901-01-01-00.00.00.000000"
}