拆分熊猫数据框列

时间:2020-06-06 13:21:00

标签: python pandas dataframe data-science

我有一个带有“ Innings”列的df,其中包含以下数据:

 Innings
[{'InningID': 297503, 'GameID': 47547, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297504, 'GameID': 47547, 'InningNumber': 2, 'AwayTeamRuns': 2, 'HomeTeamRuns': 2}, {'InningID': 297505, 'GameID': 47547, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297506, 'GameID': 47547, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297507, 'GameID': 47547, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297508, 'GameID': 47547, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297509, 'GameID': 47547, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297510, 'GameID': 47547, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297512, 'GameID': 47547, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': None}]
[{'InningID': 297511, 'GameID': 47546, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297513, 'GameID': 47546, 'InningNumber': 2, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297514, 'GameID': 47546, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297515, 'GameID': 47546, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297516, 'GameID': 47546, 'InningNumber': 5, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297517, 'GameID': 47546, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 3}, {'InningID': 297518, 'GameID': 47546, 'InningNumber': 7, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297519, 'GameID': 47546, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297520, 'GameID': 47546, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': 2}]
[{'InningID': 297521, 'GameID': 50022, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297522, 'GameID': 50022, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297523, 'GameID': 50022, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297524, 'GameID': 50022, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297525, 'GameID': 50022, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297526, 'GameID': 50022, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297527, 'GameID': 50022, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297528, 'GameID': 50022, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297529, 'GameID': 50022, 'InningNumber': 9, 'AwayTeamRuns': 3, 'HomeTeamRuns': 1}]
[{'InningID': 297530, 'GameID': 47556, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297532, 'GameID': 47556, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297534, 'GameID': 47556, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297536, 'GameID': 47556, 'InningNumber': 4, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297542, 'GameID': 47556, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297545, 'GameID': 47556, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297549, 'GameID': 47556, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297554, 'GameID': 47556, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297558, 'GameID': 47556, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297531, 'GameID': 47557, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297533, 'GameID': 47557, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297535, 'GameID': 47557, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297537, 'GameID': 47557, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297543, 'GameID': 47557, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297546, 'GameID': 47557, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297551, 'GameID': 47557, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 6}, {'InningID': 297560, 'GameID': 47557, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297566, 'GameID': 47557, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297539, 'GameID': 47549, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297540, 'GameID': 47549, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297544, 'GameID': 47549, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297548, 'GameID': 47549, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297553, 'GameID': 47549, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297557, 'GameID': 47549, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297561, 'GameID': 47549, 'InningNumber': 7, 'AwayTeamRuns': 3, 'HomeTeamRuns': 0}, {'InningID': 297573, 'GameID': 47549, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297579, 'GameID': 47549, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]
[{'InningID': 297538, 'GameID': 47558, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297541, 'GameID': 47558, 'InningNumber': 2, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297547, 'GameID': 47558, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297552, 'GameID': 47558, 'InningNumber': 4, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297556, 'GameID': 47558, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297562, 'GameID': 47558, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297568, 'GameID': 47558, 'InningNumber': 7, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297577, 'GameID': 47558, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297584, 'GameID': 47558, 'InningNumber': 9, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}]
[{'InningID': 297550, 'GameID': 47548, 'InningNumber': 1, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297555, 'GameID': 47548, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297559, 'GameID': 47548, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}, {'InningID': 297567, 'GameID': 47548, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297569, 'GameID': 47548, 'InningNumber': 5, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297578, 'GameID': 47548, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297583, 'GameID': 47548, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297587, 'GameID': 47548, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297591, 'GameID': 47548, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297594, 'GameID': 47548, 'InningNumber': 10, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297597, 'GameID': 47548, 'InningNumber': 11, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}]
[]
[{'InningID': 297564, 'GameID': 47553, 'InningNumber': 1, 'AwayTeamRuns': 2, 'HomeTeamRuns': 0}, {'InningID': 297572, 'GameID': 47553, 'InningNumber': 2, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297574, 'GameID': 47553, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297580, 'GameID': 47553, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297586, 'GameID': 47553, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297590, 'GameID': 47553, 'InningNumber': 6, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297596, 'GameID': 47553, 'InningNumber': 7, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297599, 'GameID': 47553, 'InningNumber': 8, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297603, 'GameID': 47553, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': 2}]
[{'InningID': 297563, 'GameID': 47555, 'InningNumber': 1, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297570, 'GameID': 47555, 'InningNumber': 2, 'AwayTeamRuns': 0, 'HomeTeamRuns': 1}, {'InningID': 297575, 'GameID': 47555, 'InningNumber': 3, 'AwayTeamRuns': 0, 'HomeTeamRuns': 5}, {'InningID': 297582, 'GameID': 47555, 'InningNumber': 4, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297588, 'GameID': 47555, 'InningNumber': 5, 'AwayTeamRuns': 0, 'HomeTeamRuns': 3}, {'InningID': 297593, 'GameID': 47555, 'InningNumber': 6, 'AwayTeamRuns': 0, 'HomeTeamRuns': 0}, {'InningID': 297595, 'GameID': 47555, 'InningNumber': 7, 'AwayTeamRuns': 1, 'HomeTeamRuns': 0}, {'InningID': 297598, 'GameID': 47555, 'InningNumber': 8, 'AwayTeamRuns': 1, 'HomeTeamRuns': 2}, {'InningID': 297601, 'GameID': 47555, 'InningNumber': 9, 'AwayTeamRuns': 0, 'HomeTeamRuns': None}]

我需要将一局列拆分成许多列(行数可能与行数不同...)。 我尝试了爆炸功能,但由于字段数量随机而无法进行拆分。

最好是创建新的列,例如:

InningID1 : xxx, InningNumber1: x, AwayTeamsRuns1: x, HomeTeamRuns1: x, InningID2 : xxx, InningNumber2: x, AwayTeamsRuns2: x, HomeTeamRuns2: x etc...

我不在乎GameId,因为我已经在df的同一行中包含了此信息。

有什么建议吗?

非常感谢

杰弗里

2 个答案:

答案 0 :(得分:1)

您可以尝试此操作,首先清理数据,将'替换为',然后将None转换为“ None”,然后遍历DataFrame,将json字符串转换为dict,使用json_normalize来转换dict到DataFrame并最终concat

df['Innings'] = df['Innings'].str.replace("\'", '"')
df['Innings'] = df['Innings'].str.replace("None", '"None"')
dfs = []
for i in range(0, df.shape[0]):
    j_str = df.loc[i]['Innings']
    x = json.loads(j_str)
    dfs.append(pd.json_normalize(x))

df = pd.concat(dfs).reset_index(drop=['index'])
print(df)

输出:

    InningID  GameID  InningNumber  AwayTeamRuns HomeTeamRuns
0     297503   47547             1             0            3
1     297504   47547             2             2            2
2     297505   47547             3             0            2
3     297506   47547             4             0            0
4     297507   47547             5             0            0
..       ...     ...           ...           ...          ...
87    297588   47555             5             0            3
88    297593   47555             6             0            0
89    297595   47555             7             1            0
90    297598   47555             8             1            2
91    297601   47555             9             0         None

答案 1 :(得分:0)

这似乎是使用DictVectorizer

的绝佳机会
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
vectorizer = DictVectorizer()
dicts = df.explode("Innings")["Innings"]
data = vectorizer.fit_transform(dicts).todense()
innings_df = pd.DataFrame(data, columns=vectorizer.feature_names_)

缺少键会自动填充零,请确保这是您想要的行为。

现在,您可以根据需要将innings_df与原始数据帧(df)合并/合并