如何将Dataframe单元格中的列表分解为单独的行

时间:2015-09-08 22:43:05

标签: python pandas dataframe

我希望将包含列表的pandas单元格转换为每个值的行。

所以,拿这个:

enter image description here

如果我想要解压缩并堆叠nearest_neighbors列中的值,以便每个值都是每个opponent索引中的一行,那么我最好怎么做呢?是否有适合此类操作的pandas方法?

10 个答案:

答案 0 :(得分:28)

使用apply(pd.Series)stack,然后使用reset_indexto_frame

In [1803]: (df.nearest_neighbors.apply(pd.Series)
              .stack()
              .reset_index(level=2, drop=True)
              .to_frame('nearest_neighbors'))
Out[1803]:
                    nearest_neighbors
name       opponent
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

详细

In [1804]: df
Out[1804]:
                                                   nearest_neighbors
name       opponent
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]

答案 1 :(得分:15)

我认为这是一个非常好的问题,在Hive中你会使用EXPLODE,我认为有一种情况可以说Pandas默认应该包含这个功能。我可能会使用嵌套的生成器理解来爆炸列表列,如下所示:

pd.DataFrame({
    "name": i[0],
    "opponent": i[1],
    "nearest_neighbor": neighbour
    }
    for i, row in df.iterrows() for neighbour in row.nearest_neighbors
    ).set_index(["name", "opponent"])

答案 2 :(得分:9)

到目前为止,我发现的最快方法是使用.iloc扩展DataFrame并返回展平目标列。

考虑到通常的输入(复制了一下):

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))
df = pd.concat([df]*10)

df
Out[3]: 
                                                   nearest_neighbors
name       opponent                                                 
A.J. Price 76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           bobcats   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           76ers     [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
           blazers   [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
...

鉴于以下建议的替代方案:

col_target = 'nearest_neighbors'

def extend_iloc():
    # Flatten columns of lists
    col_flat = [item for sublist in df[col_target] for item in sublist] 
    # Row numbers to repeat 
    lens = df[col_target].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    cols = [i for i,c in enumerate(df.columns) if c != col_target]
    new_df = df.iloc[ilocations, cols].copy()
    new_df[col_target] = col_flat
    return new_df

def melt():
    return (pd.melt(df[col_target].apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name=col_target)
            .set_index(['name', 'opponent'])
            .drop('variable', axis=1)
            .dropna()
            .sort_index())

def stack_unstack():
    return (df[col_target].apply(pd.Series)
            .stack()
            .reset_index(level=2, drop=True)
            .to_frame(col_target))

我发现extend_iloc()最快的

%timeit extend_iloc()
3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit melt()
22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit stack_unstack()
11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

答案 3 :(得分:7)

使用apply(pd.Series)更好的替代解决方案:

df = pd.DataFrame({'listcol':[[1,2,3],[4,5,6]]})

# expand df.listcol into its own dataframe
tags = df['listcol'].apply(pd.Series)

# rename each variable is listcol
tags = tags.rename(columns = lambda x : 'listcol_' + str(x))

# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)

答案 4 :(得分:6)

与Hive的EXPLODE功能类似:

import copy

def pandas_explode(df, column_to_explode):
    """
    Similar to Hive's EXPLODE function, take a column with iterable elements, and flatten the iterable to one element 
    per observation in the output table

    :param df: A dataframe to explod
    :type df: pandas.DataFrame
    :param column_to_explode: 
    :type column_to_explode: str
    :return: An exploded data frame
    :rtype: pandas.DataFrame
    """

    # Create a list of new observations
    new_observations = list()

    # Iterate through existing observations
    for row in df.to_dict(orient='records'):

        # Take out the exploding iterable
        explode_values = row[column_to_explode]
        del row[column_to_explode]

        # Create a new observation for every entry in the exploding iterable & add all of the other columns
        for explode_value in explode_values:

            # Deep copy existing observation
            new_observation = copy.deepcopy(row)

            # Add one (newly flattened) value from exploding iterable
            new_observation[column_to_explode] = explode_value

            # Add to the list of new observations
            new_observations.append(new_observation)

    # Create a DataFrame
    return_df = pandas.DataFrame(new_observations)

    # Return
    return return_df

答案 5 :(得分:3)

simplified significantly in pandas 0.25爆炸了类似列表的列,并增加了 explode()方法:

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                    'opponent': ['76ers', 'blazers', 'bobcats'], 
                    'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
      .set_index(['name', 'opponent']))

df.explode('nearest_neighbors')

出局:

                    nearest_neighbors
name       opponent                  
A.J. Price 76ers          Zach LaVine
           76ers           Jeremy Lin
           76ers        Nate Robinson
           76ers                Isaia
           blazers        Zach LaVine
           blazers         Jeremy Lin
           blazers      Nate Robinson
           blazers              Isaia
           bobcats        Zach LaVine
           bobcats         Jeremy Lin
           bobcats      Nate Robinson
           bobcats              Isaia

答案 6 :(得分:2)

所有这些答案都很好,但是我想要一些“非常简单”的东西,所以这是我的贡献:

def explode(series):
    return pd.Series([x for _list in series for x in _list])                               

就是这样。仅当您需要一个新的列表被“分解”的系列时,才使用它。这是一个示例,其中我们对taco选择进行value_counts():

In [1]: my_df = pd.DataFrame(pd.Series([['a','b','c'],['b','c'],['c']]), columns=['tacos'])      
In [2]: my_df.head()                                                                               
Out[2]: 
   tacos
0  [a, b, c]
1     [b, c]
2        [c]

In [3]: explode(my_df['tacos']).value_counts()                                                     
Out[3]: 
c    3
b    2
a    1

答案 7 :(得分:1)

这是对较大数据帧的潜在优化。当"爆炸"中有几个相等的值时,这会运行得更快。领域。 (数据帧与字段中的唯一值计数相比越大,此代码执行得越好。)

def lateral_explode(dataframe, fieldname): 
    temp_fieldname = fieldname + '_made_tuple_' 
    dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)       
    list_of_dataframes = []
    for values in dataframe[temp_fieldname].unique().tolist(): 
        list_of_dataframes.append(pd.DataFrame({
            temp_fieldname: [values] * len(values), 
            fieldname: list(values), 
        }))
    dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\ 
        .merge(pd.concat(list_of_dataframes), how='left', on=temp_fieldname) 
    del dataframe[temp_fieldname]

    return dataframe

答案 8 :(得分:1)

扩展Oleg的.iloc答案以自动展平所有列表列:

def extend_iloc(df):
    cols_to_flatten = [colname for colname in df.columns if 
    isinstance(df.iloc[0][colname], list)]
    # Row numbers to repeat 
    lens = df[cols_to_flatten[0]].apply(len)
    vals = range(df.shape[0])
    ilocations = np.repeat(vals, lens)
    # Replicate rows and add flattened column of lists
    with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
    col_idxs = list(zip(*with_idxs)[0])
    new_df = df.iloc[ilocations, col_idxs].copy()

    # Flatten columns of lists
    for col_target in cols_to_flatten:
        col_flat = [item for sublist in df[col_target] for item in sublist]
        new_df[col_target] = col_flat

    return new_df

这假设每个列表列具有相同的列表长度。

答案 9 :(得分:1)

您可以使列变平,而不是使用apply(pd.Series)。这样可以提高性能。

df = (pd.DataFrame({'name': ['A.J. Price'] * 3, 
                'opponent': ['76ers', 'blazers', 'bobcats'], 
                'nearest_neighbors': [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']] * 3})
  .set_index(['name', 'opponent']))



%timeit (pd.DataFrame(df['nearest_neighbors'].values.tolist(), index = df.index)
           .stack()
           .reset_index(level = 2, drop=True).to_frame('nearest_neighbors'))

1.87 ms ± 9.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


%timeit (df.nearest_neighbors.apply(pd.Series)
          .stack()
          .reset_index(level=2, drop=True)
          .to_frame('nearest_neighbors'))

2.73 ms ± 16.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)