更好地使用Pandas DataFrames的示例：

Question

很抱歉，如果这很简单或已经被问到，我是Python的新手，正在使用json文件，所以我很困惑。

我从网站上抓取了一个9 GB的json文件。该数据包含约300万个人的信息。每个人都有属性，但并非所有人都具有相同的属性。属性对应于json文件中的键，如下所示：

{
  "_id": "in-00000001",
  "name": {
    "family_name": "Trump",
    "given_name": "Donald"
  },
  "locality": "United States",
  "skills": [
    "Twitter",
    "Real Estate",
    "Golf"
     ],
  "industry": "Government",
  "experience": [
  {
    "org": "Republican",
    "end": "Present",
    "start": "January 2017",
    "title": "President of the United States"
  },
  {
    "org": "The Apprentice",
    "end": "2015",
    "start": "2003",
    "title": "The guy that fires people"
  }]
}

因此，_id，name，locality，skills，industry和experience是属性（键）。另一个配置文件可能具有其他属性，例如education，awards，interests，或者缺少其他配置文件中的某些属性，例如skills属性，等等。

我想做的是扫描json文件中的每个配置文件，如果配置文件包含属性skills，industry和experience，我想提取该信息并将其插入数据框（我想为此需要熊猫吗？）。我想从experience中提取其当前雇主的姓名，即org下的最新雇主。数据框如下所示：

    Industry   | Current employer | Skills
    ___________________________________________________________________
    Government | Republican       | Twitter, Real Estate, Golf
    Marketing  | Marketers R Us   | Branding, Social Media, Advertising

...等等，所有具有这三个属性的配置文件。

我正在努力寻找一个很好的资源来解释如何做这种事情，因此是我的问题。

我想大概的伪代码是：

for each profile in open(path to .json file):
    if profile has keys "experience", "industry" AND "skills":
        on the same row of the data frame:
            insert current employer into "current employer" column of 
            data frame
            insert industry into "industry" column of data frame
            insert list of skills into "skills" column of data frame

我只需要知道如何用Python编写代码。

Answer 1

我认为文件包含所有配置文件，例如

{
    "profile 1" : {
        # Full object as in the example above
    },
    "profile 2" : {
        #Full object as in the example above
    }
}

在继续之前，让我展示一个使用Pandas DataFrames的正确方法。

更好地使用Pandas DataFrames的示例：

Pandas DataFrame中的值不能为列表。因此，我们将不得不复制下面的示例中所示的行。请查看此问题和JD Long的答案以获取更多详细信息：how to use lists as values in pandas dataframe?

ID      |    Industry   | Current employer | Skill
___________________________________________________________________
in-01   |    Government | Republican       | Twitter
in-01   |    Government | Republican       | Real Estate
in-01   |    Government | Republican       | Golf
in-02   |    Marketing  | Marketers R Us   | Branding
in-02   |    Marketing  | Marketers R Us   | Social Media
in-02   |    Marketing  | Marketers R Us   | Advertising

在以下代码的注释中查找解释：

import json
import pandas as pd

# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill']) 

#Load the file as json. 
with open(path to .json file) as file:
    #readlines() reads the file as string and loads() loads it into a dict
    obj = json.loads(''.join(file.readlines()))
    #Then iterate its items() as key value pairs
    #But the line of code below depends on my first assumption.
    #Depending on the file format, the line below might have to differ.
    for prof_key, profile in obj.items():
        # Verify if a profile contains all the required keys
        if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
            for skill in profile["skills"]:
                df.loc[-1] = [profile["_id"],
                              profile["industry"],
                              [x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
                              skill]

上面的行df.loc[-1] = ...在数据框中插入一行作为最后一行（索引-1）。

以后您希望使用此信息时，将不得不使用df.groupby('ID')

请让我知道您的文件格式是否不同，或者这种解释是否足以使您入门或需要更多信息。

如何将json文件中的特定键插入Python中的数据框

1 个答案:

更好地使用Pandas DataFrames的示例：