来自TMDB csv文件的片段:
movie_id,title,cast,crew
19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""credit_id"": ""5602a8a7c3a3685532001c9a"", ""gender"": 2, ""id"": 65731, ""name"": ""Sam Worthington"", ""order"": 0}, {""cast_id"": 3, ""character"": ""Neytiri"", ""credit_id"": ""52fe48009251416c750ac9cb"", ""gender"": 1, ""id"": 8691, ""name"": ""Zoe Saldana"", ""order"": 1}, {""cast_id"": 25, ""character"": ""Dr. Grace Augustine"", ""credit_id"": ""52fe48009251416c750aca39"", ""gender"": 1, ""id"": 10205, ""name"": ""Sigourney Weaver"", ""order"": 2}, {""cast_id"": 4, ""character"": ""Col. Quaritch"", ""credit_id"": ""52fe48009251416c750ac9cf"", ""gender"": 2, ""id"": 32747, ""name"": ""Stephen Lang"", ""order"": 3}, {""cast_id"": 5, ""character"": ""Trudy Chacon"", ""credit_id"": ""52fe48009251416c750ac9d3"", ""gender"": 1, ""id"": 17647, ""name"": ""Michelle Rodriguez"", ""order"": 4}, {""cast_id"": 8, ""character"": ""Selfridge"", ""credit_id"": ""52fe48009251416c750ac9e1"", ""gender"": 2, ""id"": 1771, ""name"": ""Giovanni Ribisi"", ""order"": 5}
代码:
tmdb_credit_df = pd.read_csv('tmdb.csv')
tmdb_credit_df['cast'] = tmdb_credit_df['cast'].apply(eval)
cast
列中的每个单元格都包含一个dicts列表。例如:
[{'cast_id': 242,
'character': 'Jake Sully',
'credit_id': '5602a8a7c3a3685532001c9a',
'gender': 2,
'id': 65731,
'name': 'Sam Worthington',
'order': 0},
{'cast_id': 3,
'character': 'Neytiri',
'credit_id': '52fe48009251416c750ac9cb',
'gender': 1,
'id': 8691,
'name': 'Zoe Saldana',
'order': 1}, ...]
我试图压缩数据框,使其看起来像:
movie_id title cast_id character ...
0 19995 Avatar 242 Jake Sully ...
1 19995 Avatar 3 Neytiri ...
有没有办法使用json_normalize()
或.apply()
来展平/解包这个表,而不是遍历每一行?
我尝试使用json_normalize(tmdb_credit_df.cast)
,但收到错误:
'list' object has no attribute 'values'
我还尝试tmdb_credit_df.cast.apply(lambda x: x[0])
一次提取一个字段,但我收到以下错误:
list index out of range
答案 0 :(得分:2)
从 -
开始df
movie_id title cast
0 19995 Avatar [{"cast_id": 242, "character": "Jake Sully", "...
此处,cast
是字符串的列。
cast
json.loads
列转换为python对象列
df
to_dict
转换为字典
json_normalize
-
使用apply(pd.io.json.loads)
+ to_dict
-
d = df.assign(cast=df.cast.apply(pd.io.json.loads)).to_dict('r')
接下来,使用json_normalize
和meta
参数调用record_path
-
df = pd.io.json.json_normalize(d, meta=['movie_id', 'title'], record_path=['cast'])
df
cast_id character credit_id gender id \
0 242 Jake Sully 5602a8a7c3a3685532001c9a 2 65731
1 3 Neytiri 52fe48009251416c750ac9cb 1 8691
2 25 Dr. Grace Augustine 52fe48009251416c750aca39 1 10205
3 4 Col. Quaritch 52fe48009251416c750ac9cf 2 32747
4 5 Trudy Chacon 52fe48009251416c750ac9d3 1 17647
5 8 Selfridge 52fe48009251416c750ac9e1 2 1771
name order title movie_id
0 Sam Worthington 0 Avatar 19995
1 Zoe Saldana 1 Avatar 19995
2 Sigourney Weaver 2 Avatar 19995
3 Stephen Lang 3 Avatar 19995
4 Michelle Rodriguez 4 Avatar 19995
5 Giovanni Ribisi 5 Avatar 19995
答案 1 :(得分:0)
这循环遍历csv并将json(cast_str
)转换为Python的dicts列表(cast
),并将其传递给Pandas' DataFrame构造函数(cast_df
)会将生成的DataFrame附加到列表(frames
),然后最终将frames
的内容合并到一个大的DataFrame(df
)中。 / p>
import csv
import json
import pandas as pd
path = "/path/to/file/tmdb.csv"
frames = list()
reader = csv.reader(open(path))
next(reader) # skip the header row in csv
for movie_id_str, title, cast_str, _ in reader:
cast = json.loads(cast_str)
cast_df = pd.DataFrame(cast)
cast_df['movie_id'] = int(movie_id)
cast_df['title'] = title
frames.append(cast_df)
df = pd.concat(frames, ignore_index=True)
注意,我确实需要通过在末尾添加]", "[]"
来更改测试数据,以便正确解析。