遍历行并根据条件分配值

时间:2019-05-31 07:50:51

标签: python python-3.x pandas data-cleaning

我在数据框中的每一行都有日期,并希望根据日期条件为新列分配一个值。

通常,如果我为新列分配一个值,我将执行以下操作:

def get_mean(df):
   return df.assign(

   grouped_mean = lambda df: df.groupby('group')['X']
    .transform(lambda df: df.mean())

   )


不,我正在寻找这样的解决方案,因为我现在拥有的解决方案非常缓慢且不美观。

是否有比我当前的解决方案更好的方法,并使用assign?

我目前想出了以下解决方案:

def set_season(df):
    df = df.copy()
    for i in df.index:
        if (df.loc[i, 'Date'] >= pd.Timestamp('2008-08-30')) & (df.loc[i, 'Date'] <= pd.Timestamp('2009-05-31')):
            df.at[i, 'season'] = '08-09'
        elif  (df.loc[i, 'Date'] >= pd.Timestamp('2009-08-22')) & (df.loc[i, 'Date'] <= pd.Timestamp('2010-05-16')):
            df.at[i, 'season'] = '09-10'
        elif  (df.loc[i, 'Date'] >= pd.Timestamp('2010-08-28')) & (df.loc[i, 'Date'] <= pd.Timestamp('2011-05-22')):
            df.at[i, 'season'] = '10-11'

    return df

2 个答案:

答案 0 :(得分:3)

在大熊猫中以及大多数情况下,在Python一般情况下,我们要避免遍历我们的数据,因为它可能会慢到1000倍。对于大多数问题案例,Pandas和numpy提供了许多矢量化解决方案。进一步了解here

在您的情况下,我们可以使用np.select来定义多个条件,并根据这些条件定义选择

此外,我们可以通过将Series.betweeninclusive=True参数一起使用来使您的代码更加美观。

conditions = [
    df['Date'].between('2008-08-30', '2009-05-31', inclusive=True),
    df['Date'].between('2009-08-22', '2010-05-16', inclusive=True),
    df['Date'].between('2010-08-28', '2011-05-22', inclusive=True)
]

choices = ['08-09', '09-10', '10-11']

df['season'] = np.select(conditions, choices, default='99-99')

边注

我们还可以通过删除两个lambda函数并简单地为新列分配groupbytransform并接受其他参数来更好地重写您的第一个函数:groupmean_col

def get_mean(df, group, mean_col):

    df['mean'] = df.groupby(group)[mean_col].transform('mean')

    return df

示例

# Example dataframe
df = pd.DataFrame({'Fruit':['Banana', 'Strawberry', 'Apple', 'Banana', 'Apple'],
                   'Weight':[10, 12, 8, 9, 14]})

        Fruit  Weight
0      Banana      10
1  Strawberry      12
2       Apple       8
3      Banana       9
4       Apple      14

get_mean(df, 'Fruit', 'Weight')

        Fruit  Weight  mean
0      Banana      10   9.5
1  Strawberry      12  12.0
2       Apple       8  11.0
3      Banana       9   9.5
4       Apple      14  11.0

答案 1 :(得分:0)

如果新列.apply()仅取决于一列,请使用'season'方法:

def your_function(date):
    """
    takes a date a returns a string season
    """
    # code your function here

df['season'] = df['Date'].apply(your_function)

如果新列'season'依赖于其他多个列,请使用axis = 1

def your_function(row):
    """
    takes a row from your dataframe and returns a result
    """
    # code your function here
    # example if you want a sum of col1, col2, col3
    return row['col1'] + row['col2'] + row['col3']

df['season'] = df.apply(your_function, axis = 1)