我有~100k JSON文件,每个文件都包含JSON,我正在循环创建一个单词模型包 - 非常简单。每个JSON文件如下所示:
[{"tokens":[{"word":"Voices","lemma":"voice","pos":"NNS","ner":"O"},{"word":"from","lemma":"from","pos":"IN","ner":"O"},{"word":"Russia","lemma":"Russia","pos":"NNP","ner":"LOCATION"}],"dependencies":[{"head":0,"dep":2,"label":"prep_from"}]},{"tokens":[{"word":"Wednesday","lemma":"Wednesday","pos":"NNP","ner":"DATE"},{"word":",","lemma":",","pos":",","ner":"DATE"},{"word":"11","lemma":"11","pos":"CD","ner":"DATE"},
....
我需要的是仅为每个文件提取"word"
个键的值,并将此数组存储在一个名为的新文件中,以便每个文件都有一个数组:
["Voices", "from", "Wednesday","Russia", "," ,"11"...]
此外,我还有一个类似的数组,用于存放在../../data/train_jsons/all_words.json
但是json.loads
为每个项目而不是dict创建一个列表。如何通过循环遍历每个文件的列表来实现我想要的,并将这些单独的单词数组存储在维护json的文件路径名称的新文件中,例如:名为../../data/train_jsons/words_for_.........json
的新文件?
尝试转换为字典并使用键#34;字"似乎没有工作:
for subdir, dirs, files in os.walk('../../data/train_jsons'):
for file in files:
filepath = subdir + os.sep + file
if filepath.endswith(".json"):
with open(filepath) as data_file:
data = json.load(data_file)
dict = dict(itertools.izip_longest(*[iter(data)] * 2, fillvalue=""))
速度是我解决方案的关键因素。
答案 0 :(得分:1)
与 d = [{'tokens':[{'lemma':'voice','ner':'O','word':'Voices','pos':'NNS'},{'lemma':'from ','ner':'O','word':'from','pos':'IN'},{'lemma':'俄罗斯','ner':'LOCATION','word':'俄罗斯','pos':'NNP'}],'依赖关系':[{'dep':2,'head':0,'label':'prep_from'}]}}
这对我有用
[u['word'] for u in x['tokens'] for x in d]