在DataFrame

时间:2018-01-09 23:03:05

标签: python json pandas dataframe nested

来自TMDB csv文件的片段:

movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}, {""cast_id"": 5, ""character"": ""Trudy Chacon"", ""credit_id"": ""52fe48009251416c750ac9d3"", ""gender"": 1, ""id"": 17647, ""name"": ""Michelle Rodriguez"", ""order"": 4}, {""cast_id"": 8, ""character"": ""Selfridge"", ""credit_id"": ""52fe48009251416c750ac9e1"", ""gender"": 2, ""id"": 1771, ""name"": ""Giovanni Ribisi"", ""order"": 5}

代码:

tmdb_credit_df = pd.read_csv('tmdb.csv')
tmdb_credit_df['cast'] = tmdb_credit_df['cast'].apply(eval)

cast列中的每个单元格都包含一个dicts列表。例如:

[{'cast_id': 242,
  'character': 'Jake Sully',
  'credit_id': '5602a8a7c3a3685532001c9a',
  'gender': 2,
  'id': 65731,
  'name': 'Sam Worthington',
  'order': 0},
 {'cast_id': 3,
  'character': 'Neytiri',
  'credit_id': '52fe48009251416c750ac9cb',
  'gender': 1,
  'id': 8691,
  'name': 'Zoe Saldana',
  'order': 1}, ...]

我试图压缩数据框,使其看起来像:

    movie_id    title   cast_id    character    ...
0   19995      Avatar   242        Jake Sully   ...
1   19995      Avatar   3          Neytiri      ...

有没有办法使用json_normalize().apply()来展平/解包这个表,而不是遍历每一行?

我尝试使用json_normalize(tmdb_credit_df.cast),但收到错误:

'list' object has no attribute 'values'

我还尝试tmdb_credit_df.cast.apply(lambda x: x[0])一次提取一个字段,但我收到以下错误:

list index out of range

2 个答案:

答案 0 :(得分:2)

从 -

开始
df

   movie_id   title                                               cast
0     19995  Avatar  [{"cast_id": 242, "character": "Jake Sully", "...

此处,cast字符串的列。

  1. 首先,使用cast
  2. json.loads列转换为python对象列
  3. 接下来,使用df
  4. to_dict转换为字典
  5. 最后,使用适当的参数调用json_normalize -
  6. 使用apply(pd.io.json.loads) + to_dict -

    显示前两个步骤
    d = df.assign(cast=df.cast.apply(pd.io.json.loads)).to_dict('r') 
    

    接下来,使用json_normalizemeta参数调用record_path -

    df = pd.io.json.json_normalize(d, meta=['movie_id', 'title'], record_path=['cast'])
    df
    
       cast_id            character                 credit_id  gender     id  \
    0      242           Jake Sully  5602a8a7c3a3685532001c9a       2  65731   
    1        3              Neytiri  52fe48009251416c750ac9cb       1   8691   
    2       25  Dr. Grace Augustine  52fe48009251416c750aca39       1  10205   
    3        4        Col. Quaritch  52fe48009251416c750ac9cf       2  32747   
    4        5         Trudy Chacon  52fe48009251416c750ac9d3       1  17647   
    5        8            Selfridge  52fe48009251416c750ac9e1       2   1771   
    
                     name  order   title  movie_id  
    0     Sam Worthington      0  Avatar     19995  
    1         Zoe Saldana      1  Avatar     19995  
    2    Sigourney Weaver      2  Avatar     19995  
    3        Stephen Lang      3  Avatar     19995  
    4  Michelle Rodriguez      4  Avatar     19995  
    5     Giovanni Ribisi      5  Avatar     19995 
    

答案 1 :(得分:0)

这循环遍历csv并将json(cast_str)转换为Python的dicts列表(cast),并将其传递给Pandas' DataFrame构造函数(cast_df)会将生成的DataFrame附加到列表(frames),然后最终将frames的内容合并到一个大的DataFrame(df)中。 / p>

import csv
import json
import pandas as pd

path = "/path/to/file/tmdb.csv"
frames = list()

reader = csv.reader(open(path))
next(reader)  # skip the header row in csv

for movie_id_str, title, cast_str, _ in reader:

    cast = json.loads(cast_str)            
    cast_df = pd.DataFrame(cast)            
    cast_df['movie_id'] = int(movie_id)
    cast_df['title'] = title
    frames.append(cast_df)

df = pd.concat(frames, ignore_index=True)

注意,我确实需要通过在末尾添加]", "[]"来更改测试数据,以便正确解析。