我需要处理一个大型JSON文件以从中提取信息。这是我的文件的样子:
[
{
"diagnoses": [
{
"classification_of_tumor": "not reported",
"last_known_disease_status": "not reported",
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"primary_diagnosis": "c50.9",
"submitter_id": "TCGA-AN-A0FD_diagnosis",
"tumor_stage": "stage iia",
"age_at_diagnosis": 26007.0,
"vital_status": "alive",
"morphology": "8500/3",
"days_to_death": null,
"days_to_last_known_disease_status": null,
"days_to_last_follow_up": 196.0,
"state": null,
"days_to_recurrence": null,
"diagnosis_id": "9b0c5d28-5bd6-536f-8cfb-1e96044bce38",
"tumor_grade": "not reported",
"tissue_or_organ_of_origin": "c50.9",
"days_to_birth": -26007.0,
"progression_or_recurrence": "not reported",
"prior_malignancy": "not reported",
"site_of_resection_or_biopsy": "c50.9",
"created_datetime": null
}
],
"case_id": "c6086936-7544-4da0-8c0c-114166848483",
"demographic": {
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"created_datetime": null,
"gender": "female",
"state": null,
"submitter_id": "TCGA-AN-A0FD_demographic",
"year_of_birth": 1939,
"race": "white",
"demographic_id": "423c153c-77d7-5e97-ae64-11442d5ba4f8",
"ethnicity": "not hispanic or latino",
"year_of_death": null
},
"exposures": [
{
"cigarettes_per_day": null,
"weight": null,
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"alcohol_history": null,
"alcohol_intensity": null,
"bmi": null,
"years_smoked": null,
"height": null,
"created_datetime": null,
"state": null,
"exposure_id": "0abf6770-e176-523e-a94e-66d779c58e69",
"submitter_id": "TCGA-AN-A0FD_exposure"
}
]
},
我感兴趣的输出是一个包含两列的文本文件,其中第一列是 tumor_stage ,第二列是 case_id 。对于此示例,它看起来像:
stage iia c6086936-7544-4da0-8c0c-114166848483
答案 0 :(得分:1)
这是一个执行您要求的转换的Python程序。将此代码复制到一个名为“#34; convert.py"”的文件中。然后你可以这样运行程序:
python convert.py my_existing_file.json my_new_file.txt
以下是该计划:
import argparse
import json
# Get filenames from user
parser = argparse.ArgumentParser()
parser.add_argument(
'input', type=argparse.FileType('r'), help="input JSON filename")
parser.add_argument(
'output', type=argparse.FileType('w'), help="output 2-col text filename")
args = parser.parse_args()
# Read data in
data = json.load(args.input)
# Convert data to abstracted format
data = [[d["diagnoses"][0]["tumor_stage"], d["case_id"]] for d in data]
# Write the data out:
for d in data:
args.output.write("{}\t{}\n".format(*d))