如何根据某些条件在不同的行中拆分pandas dataframe列?

时间:2018-05-09 06:39:15

标签: python pandas dataframe

我正在尝试将一个pandas DataFrame列拆分为多行。

DATA:输入数据框如下所示:

sports_name,player_name,player_country,player_average
football,XYZ,US,"[['1', '62.58'], ['2', '25.34'],['3', '88.35'],['4', '59.39']]"
football,ABC,US,"[['1', '56.61'], ['2', '52.63'],['3', 'NA'],['4', '44.32'],['5', '39.69']]"
cricket,PQR,IND,"[['1', '98.73'], ['2', '72.62'],['3', '71.53'],['4', '73.72']]"
cricket,LMN,IND,"[['1', '72.52'], ['2', '71.82'],['3', '-'],['4', '62.72'],['5', '73.83']]"

数据信息:

  1. 我们需要拆分成多行的列是 player_average
  2. 这个" Player_average" column包含字符串值,它是多个列表的列表。
  3. 列表将始终包含两个值。首先是" player_match"第二个是" player_average"。
  4. " player_average"价值可能包含" NA"或" - "或其他什么。
  5. 要求:

    1. " minimum_average"是一个整数值。
    2. 我希望每位玩家的平均比赛大于" minumum_average"。
    3. 输出:输出数据框应如下所示

      sports_name,player_name,player_country,player_match,player_average
      football,XYZ,US,1,62.58
      football,XYZ,US,3,88.35
      football,XYZ,US,4,59.39
      football,ABC,US,1,56.61
      football,ABC,US,2,52.63
      cricket,PQR,IND,1,98.73
      cricket,PQR,IND,2,72.62
      cricket,PQR,IND,3,71.53
      cricket,PQR,IND,4,73.72
      cricket,LMN,IND,1,72.52
      cricket,LMN,IND,2,71.82
      cricket,LMN,IND,4,62.72
      cricket,LMN,IND,5,73.82
      

      编辑:

      确保数据是非常庞大的数据。它可能包含〜" player_average"中的~2,000个数组。和~10,00,000行。

1 个答案:

答案 0 :(得分:1)

假设您从

开始
import ast
as_lists = pd.concat(
    [df, pd.DataFrame(df.player_average.apply(ast.literal_eval).tolist())],
    axis=1).drop('player_average', axis=1)
>>> as_lists
    sports_name player_name player_country  0   1   2   3   4
0   football    XYZ US  [1, 62.58]  [2, 25.34]  [3, 88.35]  [4, 59.39]  None
1   football    ABC US  [1, 56.61]  [2, 52.63]  [3, NA] [4, 44.32]  [5, 39.69]
2   cricket PQR IND [1, 98.73]  [2, 72.62]  [3, 71.53]  [4, 73.72]  None
3   cricket LMN IND [1, 72.52]  [2, 71.82]  [3, -]  [4, 62.72]  [5, 73.83]

现在根据列是否为数字将其熔化

melted = as_lists.melt(
    id_vars=[c for c in as_lists.columns if not isinstance(c, int)], 
    value_vars=[c for c in as_lists.columns if isinstance(c, int)]).dropna()

拆分最后一列,然后追加它:

final = pd.merge(df, melted)[['sports_name', 'player_name', 'player_country', 'value']]
>>> final.head()
    sports_name player_name player_country  value
0   football    XYZ US  [1, 62.58]
1   football    XYZ US  [2, 25.34]
2   football    XYZ US  [3, 88.35]
3   football    XYZ US  [4, 59.39]
4   football    ABC US  [1, 56.61]

现在只删除坏行:

final = final[~final.value.astype(str).str.contains(r'-|NA')]

final.head()

并拆分最后一栏:

>>> pd.concat([
    final, 
    pd.DataFrame(final.value.values.tolist(), index=final.index, columns=['player_match', 'player_average'])],
axis=1).drop('value', axis=1)
    sports_name player_name player_country  player_match    player_average
0   football    XYZ US  1   62.58
1   football    XYZ US  2   25.34
2   football    XYZ US  3   88.35
3   football    XYZ US  4   59.39
4   football    ABC US  1   56.61
5   football    ABC US  2   52.63
7   football    ABC US  4   44.32
8   football    ABC US  5   39.69
9   cricket PQR IND 1   98.73
10  cricket PQR IND 2   72.62
11  cricket PQR IND 3   71.53
12  cricket PQR IND 4   73.72
13  cricket LMN IND 1   72.52
14  cricket LMN IND 2   71.82
16  cricket LMN IND 4   62.72
17  cricket LMN IND 5   73.83