如何将json文件中的特定键插入Python中的数据框

时间:2019-10-12 20:57:47

标签: python json pandas dataframe

很抱歉,如果这很简单或已经被问到,我是Python的新手,正在使用json文件,所以我很困惑。

我从网站上抓取了一个9 GB的json文件。该数据包含约300万个人的信息。每个人都有属性,但并非所有人都具有相同的属性。属性对应于json文件中的键,如下所示:

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

因此,_idnamelocalityskillsindustryexperience是属性(键)。另一个配置文件可能具有其他属性,例如educationawardsinterests,或者缺少其他配置文件中的某些属性,例如skills属性,等等。

我想做的是扫描json文件中的每个配置文件,如果配置文件包含属性skillsindustryexperience,我想提取该信息并将其插入数据框(我想为此需要熊猫吗?)。我想从experience中提取其当前雇主的姓名,即org下的最新雇主。数据框如下所示:

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

...等等,所有具有这三个属性的配置文件。

我正在努力寻找一个很好的资源来解释如何做这种事情,因此是我的问题。

我想大概的伪代码是:

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

我只需要知道如何用Python编写代码。

1 个答案:

答案 0 :(得分:1)

我认为文件包含所有配置文件,例如

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

在继续之前,让我展示一个使用Pandas DataFrames的正确方法。

更好地使用Pandas DataFrames的示例:

Pandas DataFrame中的值不能为列表。因此,我们将不得不复制下面的示例中所示的行。请查看此问题和JD Long的答案以获取更多详细信息:how to use lists as values in pandas dataframe?

ID      |    Industry   | Current employer | Skill
___________________________________________________________________
in-01   |    Government | Republican       | Twitter
in-01   |    Government | Republican       | Real Estate
in-01   |    Government | Republican       | Golf
in-02   |    Marketing  | Marketers R Us   | Branding
in-02   |    Marketing  | Marketers R Us   | Social Media
in-02   |    Marketing  | Marketers R Us   | Advertising

在以下代码的注释中查找解释:

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

上面的行df.loc[-1] = ...在数据框中插入一行作为最后一行(索引-1)。

以后您希望使用此信息时,将不得不使用df.groupby('ID')

请让我知道您的文件格式是否不同,或者这种解释是否足以使您入门或需要更多信息。