我知道这之前已经完成,但我无法完成 - 我已经从 kaffle 中读取了数据 - https://www.kaggle.com/rounakbanik/the-movies-dataset csv样本
adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
False,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,tt0113228,en,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92
False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34
release_date genres budget
0 1995-10-30 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... 30000000
1 1995-12-15 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... 65000000
2 1995-12-22 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... 0
' 我正在尝试垂直规范化数据并尝试过此https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html:
但是我的数据是从 csv 中引入的,而不是 json 格式。想要:
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {'governor': 'Rick Scott'},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {'governor': 'John Kasich'},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
['info', 'governor']])
result
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
我想打破这种类型,所以它看起来像这样:
release_date genres.id genres.name budget
0 1995-10-30 16 Animation 30000000
1 1995-10-30 35. Genre 30000000
2 1995-12-15 12 Adventure 65000000
3 1995-12-15 14 Genre 65000000
.
.
100 1995-12-22 10749 Romance 0
101 1995-12-22 35 Genre 0
我试过使用
for data in test:
data_row = data['genres']
time = data['release_date']
for row in data_row:
row['Time'] = time
rows.append(row)
和
result = pd.json_normalize(test.to_dict(), 'genres', ['budget'])
但是我不成功,因为我的文件不是 json 并且得到错误 AttributeError: 'str' object has no attribute 'values' 我也不确定我是否研究了正确的关键字
答案 0 :(得分:2)
这是使用 flatten_json
和 apply(lambda x:)
的一种方法。出于某种原因,json_normalize
对我不起作用。
我下载了文件,读入并为此使用了前 10 行。
from flatten_json import flatten
df = pd.read_csv('movies_metadata.csv', low_memory=False)
dft = df[0:10]
def flattenjson(x):
dfa = pd.DataFrame((flatten(d, '.') for d in eval(x['genres'])))
dfa[['release_date', 'original_title', 'budget']] = x[['release_date', 'original_title', 'budget']]
df_list.append(dfa)
df_list = []
dft.apply(lambda x: flattenjson(x), axis=1)
pd.concat(df_list)
id name release_date original_title budget
0 16 Animation 1995-10-30 Toy Story 30000000
1 35 Comedy 1995-10-30 Toy Story 30000000
2 10751 Family 1995-10-30 Toy Story 30000000
0 12 Adventure 1995-12-15 Jumanji 65000000
1 14 Fantasy 1995-12-15 Jumanji 65000000
2 10751 Family 1995-12-15 Jumanji 65000000
0 10749 Romance 1995-12-22 Grumpier Old Men 0
1 35 Comedy 1995-12-22 Grumpier Old Men 0
0 35 Comedy 1995-12-22 Waiting to Exhale 16000000
1 18 Drama 1995-12-22 Waiting to Exhale 16000000
2 10749 Romance 1995-12-22 Waiting to Exhale 16000000
0 35 Comedy 1995-02-10 Father of the Bride Part II 0
0 28 Action 1995-12-15 Heat 60000000
1 80 Crime 1995-12-15 Heat 60000000
2 18 Drama 1995-12-15 Heat 60000000
3 53 Thriller 1995-12-15 Heat 60000000
0 35 Comedy 1995-12-15 Sabrina 58000000
1 10749 Romance 1995-12-15 Sabrina 58000000
0 28 Action 1995-12-22 Tom and Huck 0
1 12 Adventure 1995-12-22 Tom and Huck 0
2 18 Drama 1995-12-22 Tom and Huck 0
3 10751 Family 1995-12-22 Tom and Huck 0
0 28 Action 1995-12-22 Sudden Death 35000000
1 12 Adventure 1995-12-22 Sudden Death 35000000
2 53 Thriller 1995-12-22 Sudden Death 35000000
0 12 Adventure 1995-11-16 GoldenEye 58000000
1 28 Action 1995-11-16 GoldenEye 58000000
2 53 Thriller 1995-11-16 GoldenEye 58000000
答案 1 :(得分:1)
另一种使用 pandas.DataFrame.apply
和 pandas.DataFrame.explode
的方法:
df = df.loc[:,['release_date','genres','budget']]
df['genres'] = df.genres.apply(eval)
df = df.explode('genres').dropna()
df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
df.drop('genres', axis=1, inplace=True)
输出:
>>> df
release_date budget genres.id genres.name
0 1995-10-30 30000000 16 Animation
0 1995-10-30 30000000 35 Comedy
0 1995-10-30 30000000 10751 Family
1 1995-12-15 65000000 12 Adventure
1 1995-12-15 65000000 14 Fantasy
1 1995-12-15 65000000 10751 Family
2 1995-12-22 0 10749 Romance
2 1995-12-22 0 35 Comedy
3 1995-12-22 16000000 35 Comedy
3 1995-12-22 16000000 18 Drama
3 1995-12-22 16000000 10749 Romance
由于 genres 列中的字典列表实际上是字符串值,因此应用 eval
将它们转换回字典列表。
使用 pd.explode
方法将每个列表中的字典转换为单独的行值(因此每行只有一个字典):
>>> df['genres'] = df.genres.apply(eval)
>>> df = df.explode('genres').dropna()
>>> df
release_date genres budget
0 1995-10-30 {'id': 16, 'name': 'Animation'} 30000000
0 1995-10-30 {'id': 35, 'name': 'Comedy'} 30000000
0 1995-10-30 {'id': 10751, 'name': 'Family'} 30000000
1 1995-12-15 {'id': 12, 'name': 'Adventure'} 65000000
1 1995-12-15 {'id': 14, 'name': 'Fantasy'} 65000000
1 1995-12-15 {'id': 10751, 'name': 'Family'} 65000000
2 1995-12-22 {'id': 10749, 'name': 'Romance'} 0
2 1995-12-22 {'id': 35, 'name': 'Comedy'} 0
3 1995-12-22 {'id': 35, 'name': 'Comedy'} 16000000
3 1995-12-22 {'id': 18, 'name': 'Drama'} 16000000
3 1995-12-22 {'id': 10749, 'name': 'Romance'} 16000000
通过对其应用 pandas.Series
将字典键值转换为单独的列。删除旧的流派列:
>>> df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
>>> df.drop('genres', axis=1, inplace=True)
>>> df
release_date budget genres.id genres.name
0 1995-10-30 30000000 16 Animation
0 1995-10-30 30000000 35 Comedy
0 1995-10-30 30000000 10751 Family
1 1995-12-15 65000000 12 Adventure
1 1995-12-15 65000000 14 Fantasy
1 1995-12-15 65000000 10751 Family
2 1995-12-22 0 10749 Romance
2 1995-12-22 0 35 Comedy
3 1995-12-22 16000000 35 Comedy
3 1995-12-22 16000000 18 Drama
3 1995-12-22 16000000 10749 Romance