我从红移群集中获取了大量数据。 前4列用'|'分隔然后2列是JSON。
XXX|ABANDONED|1197|11|"{""currency"":""EUR"" item_id"":""143"" type"":""FLIGHT"" name"":""PAR-FEZ"" price"":1111 origin"":""PAR"" destination"":""FEZ"" merchant"":""GOV"" flight_type"":""OW"" flight_segment"":[{ origin"":""ORY"" destination"":""FEZ"" departure_date_time"":""2015-08-02T07:20"" arrival_date_time"":""2015-08-02T09:05"" carrier"":""AT"" f_class"":""ECONOMY""}]}"|"{""type"":""FLIGHT"" name"":""FI_ORY-OUD"" item_id"":""FLIGHT"" currency"":""EUR"" price"":111 origin"":""ORY"" destination"":""OUD"" flight_type"":""OW"" flight_segment"":[{""origin"":""ORY"" destination"":""OUD"" departure_date_time"":""2015-08-02T13:55"" arrival_date_time"":""2015-08-02T15:30"" flight_number"":""AT625"" carrier"":""AT"" f_class"":""ECONOMIC_DISCOUNTED""}]}"
在Python 2.7中工作想要将JSON值分离出来并将其转换为Pandas数据帧,但我对pyparsing缺乏经验。
我的方法是将文件作为带有'|'的Pandas数据框读入作为分隔符,而不是使用包含JSON的列并使用'JSON_normalise'展平它,但JSON_normalise不会对熊猫的列进行索引
我发现了解决方案here和here,但其中一个不适合我的“混合数据”,另一个则是针对相当大的JSON文件进行简单化
有关如何在此数据上部署Pyparsing的任何提示都非常有用。 感谢
Pyparsing: Parsing semi-JSON nested plaintext data to a list
答案 0 :(得分:1)
将上面的输入字符串作为名为'data'的变量,这个Python + pyparsing代码将对它有所了解。不幸的是,第四个'|'右侧的东西不是真正的JSON。幸运的是, 格式足够好,可以解析它而不会过度不适。请参阅以下程序中的嵌入式注释:
fields = data.split('|',4)
result = obsList.parseString(fields[-1])
# we get back a list of objects, dump them out
for r in result:
print r.dump()
print
现在将该解析器应用于您的'数据':
[['currency', 'EUR'], ['item_id', '143'], ['type', 'FLIGHT'], ['name', 'PAR-FEZ'], ['price', 1111], ['origin', 'PAR'], ['destination', 'FEZ'], ['merchant', 'GOV'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]]]]
- currency: EUR
- destination: FEZ
- flight_segment:
[0]:
[['origin', 'ORY'], ['destination', 'FEZ'], ['departure_date_time', datetime.datetime(2015, 8, 2, 7, 20)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 9, 5)], ['carrier', 'AT'], ['f_class', 'ECONOMY']]
- arrival_date_time: 2015-08-02 09:05:00
- carrier: AT
- departure_date_time: 2015-08-02 07:20:00
- destination: FEZ
- f_class: ECONOMY
- origin: ORY
- flight_type: OW
- item_id: 143
- merchant: GOV
- name: PAR-FEZ
- origin: PAR
- price: 1111
- type: FLIGHT
[['type', 'FLIGHT'], ['name', 'FI_ORY-OUD'], ['item_id', 'FLIGHT'], ['currency', 'EUR'], ['price', 111], ['origin', 'ORY'], ['destination', 'OUD'], ['flight_type', 'OW'], ['flight_segment', [[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]]]]
- currency: EUR
- destination: OUD
- flight_segment:
[0]:
[['origin', 'ORY'], ['destination', 'OUD'], ['departure_date_time', datetime.datetime(2015, 8, 2, 13, 55)], ['arrival_date_time', datetime.datetime(2015, 8, 2, 15, 30)], ['flight_number', 'AT625'], ['carrier', 'AT'], ['f_class', 'ECONOMIC_DISCOUNTED']]
- arrival_date_time: 2015-08-02 15:30:00
- carrier: AT
- departure_date_time: 2015-08-02 13:55:00
- destination: OUD
- f_class: ECONOMIC_DISCOUNTED
- flight_number: AT625
- origin: ORY
- flight_type: OW
- item_id: FLIGHT
- name: FI_ORY-OUD
- origin: ORY
- price: 111
- type: FLIGHT
给出:
res[0].currency
res[0].price
res[0].destination
res[0].flight_segment[0].origin
len(res[0].flight_segment) # gives how many segments
请注意,非字符串的值(整数,时间戳等)已经转换为Python类型。由于字段名称保存为dict键,因此您可以按名称访问字段,如下所示:
@{ Session.Remove("errors"); }