使用np.where子句向熊猫数据框添加多列

时间:2019-12-28 00:35:33

标签: python pandas numpy

我正在尝试在ETL逻辑中使用numpy.where()将多个列添加到数据框。

这是我的df:

enter image description here

我正在尝试将我的df作为:

enter image description here

代码是:

current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [
                np.where(
                    # When old hash code is available and new hash code is not available. 0 -- N
                    (
                            df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                    ) |
                    # When hash codes are available and matched. 3.1 -- 'N'
                    (
                            ~df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                            &
                            ~(df['new_hash'].ne(df['old_hash']))
                    ),
                    ['N', df['cr_date'], df['up_date']],
                    np.where(
                        # When new hash code is available and old hash code is not available. 1 -- Y
                        (
                                ~df['new_hash'].isna()
                                &
                                df['old_hash'].isna()
                        ),
                        ['Y', current_time, current_time],
                        np.where(
                            # When hash codes are available and matched. 3.2 -- 'Y'
                            (
                                    ~df['new_hash'].isna()
                                    &
                                    ~df['old_hash'].isna()
                                    &
                                    df['new_hash'].ne(df['old_hash'])
                            ),
                            ['Y', df['cr_date'], current_time],
                            ['N', df['cr_date'], df['up_date']]
                        )
                    )
                )
            ],
            index=df.index,
            columns=['is_changed', 'cr_date_new', 'up_date_new']
        )
    ],
    axis=1
)

使用df.join()而不是pd.concat()尝试了以上代码。仍然给我以下指定的ValueError

我可以一次添加一列。例子是:

df['is_changed'] = (
    np.where(
        # When old hash code is available and new hash code is not available. 0 -- N
        (
                df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
        ) |
        # When hash codes are available and matched. 3.1 -- 'N'
        (
                ~df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
                &
                ~(df['new_hash'].ne(df['old_hash']))
        ),
        'N',
        np.where(
            # When new hash code is available and old hash code is not available. 1 -- Y
            (
                    ~df['new_hash'].isna()
                    &
                    df['old_hash'].isna()
            ),
            'Y',
            np.where(
                # When hash codes are available and matched. 3.2 -- 'Y'
                (
                        ~df['new_hash'].isna()
                        &
                        ~df['old_hash'].isna()
                        &
                        df['new_hash'].ne(df['old_hash'])
                ),
                'Y',
                'N'
            )
        )
    )
)

但是出现多列错误(ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,))。

添加多列怎么了?有人可以帮我吗?

1 个答案:

答案 0 :(得分:1)

np.where(cond,A,B)中,Python评估condAB中的每一个,然后将它们传递给where函数。 where然后broadcasts彼此相对输入,并执行逐元素选择。您似乎有3个嵌套的where。我猜该错误发生在最里面,因为它将首先被评估(我不必猜测是否提供了错误回溯。

                    np.where(
                        # When hash codes are available and matched. 3.2 -- 'Y'
                        (
                                ~df['new_hash'].isna()
                                &
                                ~df['old_hash'].isna()
                                &
                                df['new_hash'].ne(df['old_hash'])
                        ),
                        ['Y', df['cr_date'], current_time],
                        ['N', df['cr_date'], df['up_date']]
                    )

cond部分是第一个()逻辑和表达式。

A是3元素列表,B是下一个列表。

假设有66行,则cond将具有(66,)形状。

np.array(['Y', df['cr_date'], current_time])可能是一个(3,)形状对象dtype数组,因为输入包含一个字符串,一个Series和一个时间对象。

这说明了错误消息中的3种形状:shapes (66,) (3,) (3,))

如果您尝试一次仅设置一列,则表达式将为np.where(cond, 'Y', 'N')np.where(cond, Series1, Series2)

如果您不了解broadcasting的含义(或错误),则可能需要详细了解numpy(构成pandas的基础)。