我有70 GB的文件,大部分是TXT,CSV和日志,这些文件是从公开信息中进行研究,训练神经网络等的。我想将文件中的每一行序列化为json并推动弹性搜索以使用它的。行中可能包含json编码器应转义的特殊字符,例如俄语字母,韩语等。由于Apache Lucene文件大小的限制,我不能仅将10 GB的文件编码为一个对象并将其压缩为弹性。
大多数条目包含:
9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:
254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:
2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8
2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93
254:username:someemail@gstuff:880896502dd68b546258\][;.'54cca34
2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr
2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\^c
2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|
我想获取文件的每一行(用新行分隔)并产生类似的内容(转义非法的json字符):
{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"},
{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"},
{"data":"9:username:someemail@gstuff:eafff17afbef485a894][;\'.f6d39c56b79:"}
什么是解决此问题的最佳方法?
答案 0 :(得分:0)
import json
read_my_file = open("my_file.txt","r") #open your file, I copied and paste your example in my file
lines= read_my_file.readlines()#read each line separatelly
my_list=[]#create my new list of items
for i in lines:#do a for loop for all the element in lines
my_list.append({"data":i})#for each loop create a dictionary and append it on my list
print (my_list)#print my list to ensure that it's correct
my_json=json.dumps(my_list)#convert my list to json
print (my_json)#print my json
如果您需要其他详细信息,请告诉我;)
答案 1 :(得分:0)
下面的代码无法读取内存中的所有内容。由于您谈论10Gb文件,因此 可能很重要。我会这样做:
#!/usr/bin/env python3
import json
def convert2json(filename):
with open(filename) as I:
for line in I:
d = {"data": line}
print(json.dumps(d))
if __name__ == "__main__":
import sys
convert2json(sys.argv[1])
% python scriptname.py yourfile
{"data": "9:username:someemail@gstuff:eafff17afbef485a894][;'.f6d39c56b79:\n"}
{"data": "254:Starcius:someemail@gstuff:09160da290bcd1f83fssf0bd260e13d4f:\n"}
{"data": "2:username:someemail@gstuff:104b77708bb7c19b9f913449c923a898:8\n"}
{"data": "2:username:someemail@gstuff:efc38fca88d8e58089adccce3e05f93\n"}
{"data": "254:username:someemail@gstuff:880896502dd68b546258\\][;.'54cca34\n"}
{"data": "2:username:someemail@gstuff:647b61ba8f0965e762c579e5b3da9eca:hUr\n"}
{"data": "2:username:someemail@gstuff::3e9478fcecb4e90266art87g8fiuba90c6ed5473c:\\^c\n"}
{"data": "2:username:someemail@gstuff:9df5783228asdasddas796e18cb12e44da:,M|\n"}