我正在尝试将一个pandas DataFrame列拆分为多行。
DATA:输入数据框如下所示:
sports_name,player_name,player_country,player_average
football,XYZ,US,"[['1', '62.58'], ['2', '25.34'],['3', '88.35'],['4', '59.39']]"
football,ABC,US,"[['1', '56.61'], ['2', '52.63'],['3', 'NA'],['4', '44.32'],['5', '39.69']]"
cricket,PQR,IND,"[['1', '98.73'], ['2', '72.62'],['3', '71.53'],['4', '73.72']]"
cricket,LMN,IND,"[['1', '72.52'], ['2', '71.82'],['3', '-'],['4', '62.72'],['5', '73.83']]"
数据信息:
要求:
输出:输出数据框应如下所示
sports_name,player_name,player_country,player_match,player_average
football,XYZ,US,1,62.58
football,XYZ,US,3,88.35
football,XYZ,US,4,59.39
football,ABC,US,1,56.61
football,ABC,US,2,52.63
cricket,PQR,IND,1,98.73
cricket,PQR,IND,2,72.62
cricket,PQR,IND,3,71.53
cricket,PQR,IND,4,73.72
cricket,LMN,IND,1,72.52
cricket,LMN,IND,2,71.82
cricket,LMN,IND,4,62.72
cricket,LMN,IND,5,73.82
编辑:
确保数据是非常庞大的数据。它可能包含〜" player_average"中的~2,000个数组。和~10,00,000行。
答案 0 :(得分:1)
假设您从
开始import ast
as_lists = pd.concat(
[df, pd.DataFrame(df.player_average.apply(ast.literal_eval).tolist())],
axis=1).drop('player_average', axis=1)
>>> as_lists
sports_name player_name player_country 0 1 2 3 4
0 football XYZ US [1, 62.58] [2, 25.34] [3, 88.35] [4, 59.39] None
1 football ABC US [1, 56.61] [2, 52.63] [3, NA] [4, 44.32] [5, 39.69]
2 cricket PQR IND [1, 98.73] [2, 72.62] [3, 71.53] [4, 73.72] None
3 cricket LMN IND [1, 72.52] [2, 71.82] [3, -] [4, 62.72] [5, 73.83]
现在根据列是否为数字将其熔化
melted = as_lists.melt(
id_vars=[c for c in as_lists.columns if not isinstance(c, int)],
value_vars=[c for c in as_lists.columns if isinstance(c, int)]).dropna()
拆分最后一列,然后追加它:
final = pd.merge(df, melted)[['sports_name', 'player_name', 'player_country', 'value']]
>>> final.head()
sports_name player_name player_country value
0 football XYZ US [1, 62.58]
1 football XYZ US [2, 25.34]
2 football XYZ US [3, 88.35]
3 football XYZ US [4, 59.39]
4 football ABC US [1, 56.61]
现在只删除坏行:
final = final[~final.value.astype(str).str.contains(r'-|NA')]
final.head()
并拆分最后一栏:
>>> pd.concat([
final,
pd.DataFrame(final.value.values.tolist(), index=final.index, columns=['player_match', 'player_average'])],
axis=1).drop('value', axis=1)
sports_name player_name player_country player_match player_average
0 football XYZ US 1 62.58
1 football XYZ US 2 25.34
2 football XYZ US 3 88.35
3 football XYZ US 4 59.39
4 football ABC US 1 56.61
5 football ABC US 2 52.63
7 football ABC US 4 44.32
8 football ABC US 5 39.69
9 cricket PQR IND 1 98.73
10 cricket PQR IND 2 72.62
11 cricket PQR IND 3 71.53
12 cricket PQR IND 4 73.72
13 cricket LMN IND 1 72.52
14 cricket LMN IND 2 71.82
16 cricket LMN IND 4 62.72
17 cricket LMN IND 5 73.83