很抱歉,如果这很简单或已经被问到,我是Python的新手,正在使用json文件,所以我很困惑。
我从网站上抓取了一个9 GB的json文件。该数据包含约300万个人的信息。每个人都有属性,但并非所有人都具有相同的属性。属性对应于json文件中的键,如下所示:
{
"_id": "in-00000001",
"name": {
"family_name": "Trump",
"given_name": "Donald"
},
"locality": "United States",
"skills": [
"Twitter",
"Real Estate",
"Golf"
],
"industry": "Government",
"experience": [
{
"org": "Republican",
"end": "Present",
"start": "January 2017",
"title": "President of the United States"
},
{
"org": "The Apprentice",
"end": "2015",
"start": "2003",
"title": "The guy that fires people"
}]
}
因此,_id
,name
,locality
,skills
,industry
和experience
是属性(键)。另一个配置文件可能具有其他属性,例如education
,awards
,interests
,或者缺少其他配置文件中的某些属性,例如skills
属性,等等。>
我想做的是扫描json文件中的每个配置文件,如果配置文件包含属性skills
,industry
和experience
,我想提取该信息并将其插入数据框(我想为此需要熊猫吗?)。我想从experience
中提取其当前雇主的姓名,即org
下的最新雇主。数据框如下所示:
Industry | Current employer | Skills
___________________________________________________________________
Government | Republican | Twitter, Real Estate, Golf
Marketing | Marketers R Us | Branding, Social Media, Advertising
...等等,所有具有这三个属性的配置文件。
我正在努力寻找一个很好的资源来解释如何做这种事情,因此是我的问题。
我想大概的伪代码是:
for each profile in open(path to .json file):
if profile has keys "experience", "industry" AND "skills":
on the same row of the data frame:
insert current employer into "current employer" column of
data frame
insert industry into "industry" column of data frame
insert list of skills into "skills" column of data frame
我只需要知道如何用Python编写代码。
答案 0 :(得分:1)
我认为文件包含所有配置文件,例如
{
"profile 1" : {
# Full object as in the example above
},
"profile 2" : {
#Full object as in the example above
}
}
在继续之前,让我展示一个使用Pandas DataFrames的正确方法。
Pandas DataFrame中的值不能为列表。因此,我们将不得不复制下面的示例中所示的行。请查看此问题和JD Long的答案以获取更多详细信息:how to use lists as values in pandas dataframe?
ID | Industry | Current employer | Skill
___________________________________________________________________
in-01 | Government | Republican | Twitter
in-01 | Government | Republican | Real Estate
in-01 | Government | Republican | Golf
in-02 | Marketing | Marketers R Us | Branding
in-02 | Marketing | Marketers R Us | Social Media
in-02 | Marketing | Marketers R Us | Advertising
在以下代码的注释中查找解释:
import json
import pandas as pd
# Create a DataFrame df with the columns as in the example
df = pd.DataFrame(data, columns = ['ID', 'Industry','Employer','Skill'])
#Load the file as json.
with open(path to .json file) as file:
#readlines() reads the file as string and loads() loads it into a dict
obj = json.loads(''.join(file.readlines()))
#Then iterate its items() as key value pairs
#But the line of code below depends on my first assumption.
#Depending on the file format, the line below might have to differ.
for prof_key, profile in obj.items():
# Verify if a profile contains all the required keys
if all(key in profile.keys() for key in ("_id","experience", "industry","skills")):
for skill in profile["skills"]:
df.loc[-1] = [profile["_id"],
profile["industry"],
[x for x in profile["experience"] if x["end"] == "Present"][0]["org"],
skill]
上面的行df.loc[-1] = ...
在数据框中插入一行作为最后一行(索引-1
)。
以后您希望使用此信息时,将不得不使用df.groupby('ID')
请让我知道您的文件格式是否不同,或者这种解释是否足以使您入门或需要更多信息。