Question

我编写了这段代码，以便可以对任何Pandas DataFrame进行分组，并快速获得分组大小和数据帧的样本行。

效果很好，存在一个问题：新列/索引“大小”的名称已固定，因为.assign( ... )命令不接受变量。因此，如果我的DataFrame有一个名为“ Size”的列，它将丢失。

我的计划是检查是否存在名为“大小”的列，如果存在，为索引使用其他名称。我可以将assign命令与字段名称的变量而不是固定文本？

我想避免使用骇人听闻的解决方案，例如对列进行多次重命名。

import pandas as pd
try:
    from pandas.api.extensions import register_dataframe_accessor
except ImportError:
    raise ImportError('Pandas 0.24 or better needed')

@register_dataframe_accessor("cgrp")
class CustomGrouper:
    """Extra methods for dataframes."""

    def __init__(self, df):
        self._df = df

    def group_sample(self, by, subset=None):
        result = (self._df.groupby(by).apply(lambda x: x.sample(1).assign(Size = len(x)))).set_index('Size').sort_index(ascending=False)
        return result

我可以这样称呼

df.cgrp.group_sample(by=['column1', ... ])

并获得索引为“大小”的结果

Answer 1

基本思想是使用字典解包。不用在assign函数中硬编码名称：

.assign(Size = len(x))

您可以使用字典解压缩来指定变量名称：

.assign(**{col_name: len(x)})

我采取了一些自由方式来修改您的group_sample函数，使其具有2个功能：允许用户指定自定义名称，如果不这样做，则从默认列表中选择：

def group_sample(self, by, subset=None, col_name=None):
    _col_name = None

    if col_name is not None:
        # If a user specify a column name, use it
        # Raise error if the column already exists
        if col_name in self._df.columns:
            raise ValueError(f"Dataframe already has column '{col_name}'")
        else:
            _col_name = col_name
    else:
        # Choose from a list of default names
        _col_name = next((name for name in ['Size', 'Size_', 'Size__'] if name not in self._df.columns), None)

        if _col_name is None:
            raise ValueError('Cannot determine a default name for the size column. Please specify one manually')

    result = (self._df.groupby(by).apply(lambda x: x.sample(1).assign(**{_col_name: len(x)}))).set_index(_col_name).sort_index(ascending=False)
    return result

用法：

df1 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df1.cgrp.group_sample(by=['A'])     # the column name is Size

df2 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','Size'])
df2.cgrp.group_sample(by=['A'])     # the column name is Size_

df3 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df3.cgrp.group_sample(by=['A'], col_name='B')  # error, B already exists

df4 = pd.DataFrame(np.random.randint(1, 5, (3, 2)), columns=['A','B'])
df4.cgrp.group_sample(by=['A'], col_name='MySize')  # custom column name

我可以将Pandas Dataframe.assign（...）与变量名一起使用吗

1 个答案: