Question

我正在建立一个预测模型来预测收入，并尝试从数据框中解析此'cast'值，因为它不是 list 或字典

x['cast']

输出为

0    [{'cast_id': 4, 'character': 'Lou', 'credit_id...
1    [{'cast_id': 1, 'character': 'Mia Thermopolis'...
2    [{'cast_id': 5, 'character': 'Andrew Neimann',...
3    [{'cast_id': 1, 'character': 'Vidya Bagchi', '...
4    [{'cast_id': 3, 'character': 'Chun-soo', 'cred...
5    [{'cast_id': 6, 'character': 'Pinocchio (voice...
6    [{'cast_id': 23, 'character': 'Clyde', 'credit...
7    [{'cast_id': 2, 'character': 'Himself', 'credi...
8    [{'cast_id': 1, 'character': 'Long John Silver...
9    [{'cast_id': 24, 'character': 'Jonathan Steinb...
Name: cast, dtype: object

我需要列表中的所有'character'值。但是当我尝试

x['cast'][0]['character']

它抛出此错误

TypeError: string indices must be integers

请帮助我。

Answer 1

首先将json转换为词典列表，然后通过dict键从第一个列表中获取值：

import ast

mask = x['cast'].notna()

x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(ast.literal_eval)
#alternative
#x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(pd.io.json.loads)
x.loc[mask, 'cast'] = x.loc[mask, 'cast'].apply(lambda x: x[0].get('character', 'not match data'))

编辑：

如果仍然有问题，请使用Series.str.extract：

x = pd.DataFrame({'cast':[[{'cast_id': 4, 'character': 'Lou'}], np.nan]})

x['cat'] = x['cast'].astype(str).str.extract("'character': '([^'']+)'")
print (x)
                                   cast  cat
0  [{'cast_id': 4, 'character': 'Lou'}]  Lou
1                                   NaN  NaN

无法从csv文件的特定列中解析特定值

1 个答案: