从python

时间:2016-05-31 16:20:41

标签: python arrays json dictionary nlp

我有~100k JSON文件,每个文件都包含JSON,我正在循环创建一个单词模型包 - 非常简单。每个JSON文件如下所示:

[{"tokens":[{"word":"Voices","lemma":"voice","pos":"NNS","ner":"O"},{"word":"from","lemma":"from","pos":"IN","ner":"O"},{"word":"Russia","lemma":"Russia","pos":"NNP","ner":"LOCATION"}],"dependencies":[{"head":0,"dep":2,"label":"prep_from"}]},{"tokens":[{"word":"Wednesday","lemma":"Wednesday","pos":"NNP","ner":"DATE"},{"word":",","lemma":",","pos":",","ner":"DATE"},{"word":"11","lemma":"11","pos":"CD","ner":"DATE"},
....

我需要的是仅为每个文件提取"word"个键的值,并将此数组存储在一个名为的新文件中,以便每个文件都有一个数组:

["Voices", "from", "Wednesday","Russia", "," ,"11"...]

此外,我还有一个类似的数组,用于存放在../../data/train_jsons/all_words.json

中的所有文件

但是json.loads为每个项目而不是dict创建一个列表。如何通过循环遍历每个文件的列表来实现我想要的,并将这些单独的单词数组存储在维护json的文件路径名称的新文件中,例如:名为../../data/train_jsons/words_for_.........json的新文件?

尝试转换为字典并使用键#34;字"似乎没有工作:

for subdir, dirs, files in os.walk('../../data/train_jsons'):
    for file in files:
        filepath = subdir + os.sep + file
        if filepath.endswith(".json"):
            with open(filepath) as data_file:
                data = json.load(data_file)
                dict = dict(itertools.izip_longest(*[iter(data)] * 2, fillvalue=""))

速度是我解决方案的关键因素。

1 个答案:

答案 0 :(得分:1)

与     d = [{'tokens':[{'lemma':'voice','ner':'O','word':'Voices','pos':'NNS'},{'lemma':'from ','ner':'O','word':'from','pos':'IN'},{'lemma':'俄罗斯','ner':'LOCATION','word':'俄罗斯','pos':'NNP'}],'依赖关系':[{'dep':2,'head':0,'label':'prep_from'}]}}

这对我有用

[u['word'] for u in x['tokens'] for x in d]