使用Python将数据序列化为json

时间:2018-06-21 08:51:39

标签: python json elasticsearch

我有70 GB的文件,大部分是TXT,CSV和日志,这些文件是从公开信息中进行研究,训练神经网络等的。我想将文件中的每一行序列化为json并推动弹性搜索以使用它的。行中可能包含json编码器应转义的特殊字符,例如俄语字母,韩语等。由于Apache Lucene文件大小的限制,我不能仅将10 GB的文件编码为一个对象并将其压缩为弹性。

大多数条目包含:

9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:
254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:
2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8
2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93
254:username:someemail@gstuff:880896502dd68b546258\][;.'54cca34
2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr
2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\^c
2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|

我想获取文件的每一行(用新行分隔)并产生类似的内容(转义非法的json字符):

{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"},
{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"},
{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"}

什么是解决此问题的最佳方法?

2 个答案:

答案 0 :(得分:0)

import json

read_my_file = open("my_file.txt","r") #open your file, I copied and paste your example in my file

lines= read_my_file.readlines()#read each line separatelly
my_list=[]#create my new list of items

for i in lines:#do a for loop for all the element in lines
    my_list.append({"data":i})#for each loop create a dictionary and append it on my list

print (my_list)#print my list to ensure that it's correct

my_json=json.dumps(my_list)#convert my list to json
print (my_json)#print my json

如果您需要其他详细信息,请告诉我;)

答案 1 :(得分:0)

下面的代码无法读取内存中的所有内容。由于您谈论10Gb文件,因此 可能很重要。我会这样做:

#!/usr/bin/env python3

import json


def convert2json(filename):
    with open(filename) as I:
        for line in I:
            d = {"data": line}
            print(json.dumps(d))

if __name__ == "__main__":
    import sys

    convert2json(sys.argv[1])

% python scriptname.py yourfile
{"data": "9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:\n"}
{"data": "254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:\n"}
{"data": "2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8\n"}
{"data": "2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93\n"}
{"data": "254:username:someemail@gstuff:880896502dd68b546258\\][;.'54cca34\n"}
{"data": "2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr\n"}
{"data": "2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\\^c\n"}
{"data": "2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|\n"}