Question

我有一个名为region_descriptions.json的JSON文件，该文件可在链接http://visualgenome.org/static/data/dataset/region_descriptions.json.zip中找到，建议您下载该链接以了解结构。由于此文件很大，因此无法在大多数软件中正常打开（在我的情况下，谷歌浏览器帮了我大忙）。在此JSON文件中，您会发现许多句子作为键“ phrase”的值。我需要在.txt文件的另一行中写所有短语（仅在SAME ORDER中的短语）。

通过运行以下代码，我已经获得了.txt文件link

import json

with open("region_descriptions.json", 'r') as file:
    json_data = json.load(file)

f = open("text.txt","w")

for regions_dict in json_data:
    for region in regions_dict["regions"]:
        print(region["phrase"])
        f.write(region["phrase"]+"\n")

但是我发现有些短语已连续打印两次以上，并且它们之间有空行，这似乎很奇怪。我无法打开json文件来检查获取的.txt文件是否正确。有任何解决方案的人吗？

Answer 1

我不确定您“连续两次”的意思。该解决方案在您假设“重复短语”的前提下起作用。

import json

with open("region_descriptions.json", 'r') as file:
    json_data = json.load(file)

with open('test.txt','w') as f:


    all_phrases = []

    for regions_dict in json_data:
        for region in regions_dict["regions"]:
            all_phrases.append(region['phrase'])

    new_phrases = [phrase for phrase in all_phrases if phrase.strip()] #all non-empty phrases

    new_phrases_again = [phrase for i,phrase in enumerate(new_phrases) if phrase not in new_phrases[:i]] #if the phrase has not been used before in new_phrases, add it to the final list


    f.write("\n".join(new_phrases_again))

示例test.txt输出：

the clock is green in colour
shade is along the street 
man is wearing sneakers
cars headlights are off
bikes are parked at the far edge
A sign on the facade of the building
A tree trunk on the sidewalk
A man in a red shirt
A brick sidewalk beside the street
The back of a white car

Answer 2

根据数据的外观，它是区域词典的列表，而值是区域词典的列表罐头科学家击败了我，拳头！

我的答案看起来很相似，只是没有最后两个列表的理解我要在追加之前检查空字符串。

Answer 3

import json 

with open("region_descriptions.json", 'r') as file:
    json_data = json.load(file) 

for regions_dict in json_data: 
    for region in regions_dict["regions"]: 
        print(region["phrase"])

这应该可以解决问题。只需引用所需的键并了解数据的结构即可。

做这样的事情可能会有所帮助：

import sys
import json
import pprint

with open("region_descriptions.json", 'r') as file:
    json_data = json.load(file)

for regions_dict in json_data:
    pprint.pprint(regions_dict["regions"])
    sys.exit()

您将获得格式良好的输出，以便更好地“查看”结构的外观。在lists和dictionaries上进行快速在线课程可能对了解这些对象如何保存数据有帮助。基本上[ ]是数据列表，{ }是字典（键和值对）。这是我开始的地方：https://www.codecademy.com/learn/learn-python

该代码应该可以正常工作。如果存在由于.json具有重复短语而导致的重复短语，并且空行表示某些行为空。如果您想要一个唯一的短语列表，则可以构建现有代码。如果列表中尚不存在每个短语，则将其添加到列表中。像这样：

import sys
import json

with open("region_descriptions.json", 'r') as file:
    json_data = json.load(file)

phrase_list = []

for regions_dict in json_data:
    for region in regions_dict["regions"]:
        if region["phrase"] not in phrase_list:
            phrase_list.append(region["phrase"])

我还建议将来是否可以使用一小部分数据而不是大文件。更容易弄清楚该怎么做！祝你好运！

访问json文件的一部分

3 个答案: