挖掘json文件

时间:2017-11-16 00:07:41

标签: json python-3.x pandas

我一直在尝试以多种方式(以及stackoverflow中的许多问题)来规范化深度json文件。 我曾尝试使用.apply(pd.Series),对于许多级别的词典并不是很好。

我目前正在尝试json_normalize并且已经给出了一些结果。我想我知道这个功能是如何工作的,我的问题是我不知道如何浏览字典。

到目前为止,我已经能够分为两个级别。

import json
import pandas as pd
from pandas.io.json import json_normalize
raw = json.load(open('authors.json'))
raw2 = json_normalize(raw['hits']['hits'])

它给了我我需要的东西(至少是第一级)。但我不知道如何更深入。

我试过了:

raw2 = json_normalize(raw['hits']['hits'][0])
raw2 = json_normalize(raw['hits']['hits']['_source.authors'])
TypeError: string indices must be integers

还有更多,但只是随意尝试不理解的东西不是正确的方法。我想我的问题是:

  • 我如何知道如何在json中包含下一级({} vs [])?
  • 有没有可视化的方式来代表这个?

这个主题不是在网上开发的,这很奇怪。我每天使用json数据越来越多地工作。

_id _index  _score  _source.authors _source.deleted _source.description _source.doi _source.is_valid    _source.issue   _source.journal ... _source.rating_versatility_weighted _source.review_count    _source.tag _source.title   _source.userAvg _source.user_id _source.venue_name  _source.views_count _source.volume  _type   
0   7CB3F2AD    scibase_listings    1   None    0   None        1   None    Physical Review Letters ... 0   0   [mass spectra, elementary particles, bound sta...   Evidence for a new meson: A quasinuclear NN-ba...   0   None    Physical Review Letters 0   None    listing
1   7AF8EBC3    scibase_listings    1   [{'affiliations': ['Punjabi University'], 'aut...   0   None        1   None    Journal of Industrial Microbiology & Biotechno...   ... 0   0   [flow rate, operant conditioning, packed bed r...   Development of a stable continuous flow immobi...   0   None    Journal of Industrial Microbiology & Biotechno...   0   None    listing
2   7521A721    scibase_listings    1   [{'author_id': '7FF872BC', 'author_name': 'bar...   0   None        1   None    The American Historical Review  ... 0   0   [social movements]  Feminism and the women's movement : dynamics o...   0   None    The American Historical Review  0   None    listing

这是文件的一部分(这是级别3,级别1和级别2,命中,命中)。

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '7CB3F2AD',
    '_index': 'scibase_listings',
            "_type": "listing",
            "_id": "7FDFEB02",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [
                    {
                        "author_id": "166468F4",
                        "author_name": "a bowdoin van riper"
                    },
                    {
                        "author_id": "81070854",
                        "author_name": "jeffrey h schwartz"
                    }
                ],
                "user_id": null,
                "pub_date": "1994-01-01 00:00:00",
                "pages": null,
                "doi": "",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Men Among the Mammoths: Victorian Science and the Discovery of Human Prehistory",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        },
        {
            "_index": "scibase_listings",
            "_type": "listing",
            "_id": "7538108B",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [
                    {
                        "affiliations": [
                            "Pennsylvania State University"
                        ],
                        "author_id": "7E15BDFA",
                        "author_name": "roger l geiger"
                    }
                ],
                "user_id": null,
                "pub_date": "2013-06-01 00:00:00",
                "pages": null,
                "doi": "10.1093/ahr/118.3.896a",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Elizabeth Popp Berman. Creating the Market University: How Academic Science Became an Economic Engine.",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        }
    ]

2 个答案:

答案 0 :(得分:0)

我想我想出了如何挖掘'通过json。这将取决于下一级是列表还是字典。

在我的情况下,我能够挖到下面的结尾。我仍然需要找出如何使用完整列表(可能是循环),这样我就可以获得所有值,而不仅仅是[0][1]

raw['hits']['hits'][1]['_source']['authors'][0]['affiliations']

答案 1 :(得分:0)

你可以试试这个:

json_normalize(raw['hits'],'hits','_source','authors','affiliations')