展平具有嵌套字典的数据框中的列

时间:2021-06-16 03:01:38

标签: python json pandas dictionary

我知道这之前已经完成,但我无法完成 - 我已经从 kaffle 中读取了数据 - https://www.kaggle.com/rounakbanik/the-movies-dataset csv样本

adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,poster_path,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
False,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,tt0113497,en,Jumanji,"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
False,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet.jpg', 'backdrop_path': '/hypTnLot2z8wpFS7qwsQHW1uV8u.jpg'}",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,tt0113228,en,Grumpier Old Men,"A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John and Max. Meanwhile, a sultry Italian divorcée opens a restaurant at the local bait shop, alarming the locals who worry she'll scare the fish away. But she's less interested in seafood than she is in cooking up a hot time with Max.",11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92
False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusive ""good man"" to break a string of less-than-stellar lovers. Friends and confidants Vannah, Bernie, Glo and Robin talk it all out, determined to find a better way to breathe.",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]","[{'iso_3166_1': 'US', 'name': 'United States of America'}]",1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34

  release_date                                             genres    budget
0   1995-10-30  [{'id': 16, 'name': 'Animation'}, {'id': 35, '...  30000000
1   1995-12-15  [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...  65000000
2   1995-12-22  [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...         0

' 我正在尝试垂直规范化数据并尝试过此https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html

但是我的数据是从 csv 中引入的,而不是 json 格式。想要:

data = [{'state': 'Florida',
         'shortname': 'FL',
         'info': {'governor': 'Rick Scott'},
         'counties': [{'name': 'Dade', 'population': 12345},
                      {'name': 'Broward', 'population': 40000},
                      {'name': 'Palm Beach', 'population': 60000}]},
        {'state': 'Ohio',
         'shortname': 'OH',
         'info': {'governor': 'John Kasich'},
         'counties': [{'name': 'Summit', 'population': 1234},
                      {'name': 'Cuyahoga', 'population': 1337}]}]
result = pd.json_normalize(data, 'counties', ['state', 'shortname',
                                           ['info', 'governor']])
result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich

我想打破这种类型,所以它看起来像这样:

  release_date  genres.id          genres.name                     budget
0   1995-10-30  16                 Animation                      30000000
1   1995-10-30  35.                Genre                          30000000
2   1995-12-15  12                 Adventure                      65000000
3   1995-12-15  14                 Genre                          65000000

.
.
100 1995-12-22  10749              Romance                        0
101 1995-12-22  35                 Genre                          0

我试过使用

for data in test:
    data_row = data['genres']
    time = data['release_date']
      
    for row in data_row:
        row['Time'] = time
        rows.append(row)

result = pd.json_normalize(test.to_dict(), 'genres', ['budget'])

但是我不成功,因为我的文件不是 json 并且得到错误 AttributeError: 'str' object has no attribute 'values' 我也不确定我是否研究了正确的关键字

2 个答案:

答案 0 :(得分:2)

这是使用 flatten_jsonapply(lambda x:) 的一种方法。出于某种原因,json_normalize 对我不起作用。

我下载了文件,读入并为此使用了前 10 行。

from flatten_json import flatten
df = pd.read_csv('movies_metadata.csv', low_memory=False)
dft = df[0:10]

def flattenjson(x):
    dfa = pd.DataFrame((flatten(d, '.') for d in eval(x['genres'])))
    dfa[['release_date', 'original_title', 'budget']] = x[['release_date', 'original_title', 'budget']]
    df_list.append(dfa)

df_list = []
dft.apply(lambda x: flattenjson(x), axis=1)
pd.concat(df_list)

      id       name release_date               original_title    budget
0     16  Animation   1995-10-30                    Toy Story  30000000
1     35     Comedy   1995-10-30                    Toy Story  30000000
2  10751     Family   1995-10-30                    Toy Story  30000000
0     12  Adventure   1995-12-15                      Jumanji  65000000
1     14    Fantasy   1995-12-15                      Jumanji  65000000
2  10751     Family   1995-12-15                      Jumanji  65000000
0  10749    Romance   1995-12-22             Grumpier Old Men         0
1     35     Comedy   1995-12-22             Grumpier Old Men         0
0     35     Comedy   1995-12-22            Waiting to Exhale  16000000
1     18      Drama   1995-12-22            Waiting to Exhale  16000000
2  10749    Romance   1995-12-22            Waiting to Exhale  16000000
0     35     Comedy   1995-02-10  Father of the Bride Part II         0
0     28     Action   1995-12-15                         Heat  60000000
1     80      Crime   1995-12-15                         Heat  60000000
2     18      Drama   1995-12-15                         Heat  60000000
3     53   Thriller   1995-12-15                         Heat  60000000
0     35     Comedy   1995-12-15                      Sabrina  58000000
1  10749    Romance   1995-12-15                      Sabrina  58000000
0     28     Action   1995-12-22                 Tom and Huck         0
1     12  Adventure   1995-12-22                 Tom and Huck         0
2     18      Drama   1995-12-22                 Tom and Huck         0
3  10751     Family   1995-12-22                 Tom and Huck         0
0     28     Action   1995-12-22                 Sudden Death  35000000
1     12  Adventure   1995-12-22                 Sudden Death  35000000
2     53   Thriller   1995-12-22                 Sudden Death  35000000
0     12  Adventure   1995-11-16                    GoldenEye  58000000
1     28     Action   1995-11-16                    GoldenEye  58000000
2     53   Thriller   1995-11-16                    GoldenEye  58000000

答案 1 :(得分:1)

另一种使用 pandas.DataFrame.applypandas.DataFrame.explode 的方法:

df = df.loc[:,['release_date','genres','budget']]
df['genres'] = df.genres.apply(eval)
df = df.explode('genres').dropna()
df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
df.drop('genres', axis=1, inplace=True)

输出:

>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance

步骤:

由于 genres 列中的字典列表实际上是字符串值,因此应用 eval 将它们转换回字典列表。

使用 pd.explode 方法将每个列表中的字典转换为单独的行值(因此每行只有一个字典):

>>> df['genres'] = df.genres.apply(eval)
>>> df = df.explode('genres').dropna()
>>> df
  release_date                            genres    budget
0   1995-10-30   {'id': 16, 'name': 'Animation'}  30000000
0   1995-10-30      {'id': 35, 'name': 'Comedy'}  30000000
0   1995-10-30   {'id': 10751, 'name': 'Family'}  30000000
1   1995-12-15   {'id': 12, 'name': 'Adventure'}  65000000
1   1995-12-15     {'id': 14, 'name': 'Fantasy'}  65000000
1   1995-12-15   {'id': 10751, 'name': 'Family'}  65000000
2   1995-12-22  {'id': 10749, 'name': 'Romance'}         0
2   1995-12-22      {'id': 35, 'name': 'Comedy'}         0
3   1995-12-22      {'id': 35, 'name': 'Comedy'}  16000000
3   1995-12-22       {'id': 18, 'name': 'Drama'}  16000000
3   1995-12-22  {'id': 10749, 'name': 'Romance'}  16000000

通过对其应用 pandas.Series 将字典键值转换为单独的列。删除旧的流派列:

>>> df[['genres.id','genres.name']] = df.genres.apply(pd.Series)
>>> df.drop('genres', axis=1, inplace=True)
>>> df
  release_date    budget  genres.id genres.name
0   1995-10-30  30000000         16   Animation
0   1995-10-30  30000000         35      Comedy
0   1995-10-30  30000000      10751      Family
1   1995-12-15  65000000         12   Adventure
1   1995-12-15  65000000         14     Fantasy
1   1995-12-15  65000000      10751      Family
2   1995-12-22         0      10749     Romance
2   1995-12-22         0         35      Comedy
3   1995-12-22  16000000         35      Comedy
3   1995-12-22  16000000         18       Drama
3   1995-12-22  16000000      10749     Romance