Question

我正在尝试在ETL逻辑中使用numpy.where（）将多个列添加到数据框。

这是我的df：

我正在尝试将我的df作为：

代码是：

current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [
                np.where(
                    # When old hash code is available and new hash code is not available. 0 -- N
                    (
                            df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                    ) |
                    # When hash codes are available and matched. 3.1 -- 'N'
                    (
                            ~df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                            &
                            ~(df['new_hash'].ne(df['old_hash']))
                    ),
                    ['N', df['cr_date'], df['up_date']],
                    np.where(
                        # When new hash code is available and old hash code is not available. 1 -- Y
                        (
                                ~df['new_hash'].isna()
                                &
                                df['old_hash'].isna()
                        ),
                        ['Y', current_time, current_time],
                        np.where(
                            # When hash codes are available and matched. 3.2 -- 'Y'
                            (
                                    ~df['new_hash'].isna()
                                    &
                                    ~df['old_hash'].isna()
                                    &
                                    df['new_hash'].ne(df['old_hash'])
                            ),
                            ['Y', df['cr_date'], current_time],
                            ['N', df['cr_date'], df['up_date']]
                        )
                    )
                )
            ],
            index=df.index,
            columns=['is_changed', 'cr_date_new', 'up_date_new']
        )
    ],
    axis=1
)

使用df.join()而不是pd.concat()尝试了以上代码。仍然给我以下指定的ValueError

我可以一次添加一列。例子是：

df['is_changed'] = (
    np.where(
        # When old hash code is available and new hash code is not available. 0 -- N
        (
                df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
        ) |
        # When hash codes are available and matched. 3.1 -- 'N'
        (
                ~df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
                &
                ~(df['new_hash'].ne(df['old_hash']))
        ),
        'N',
        np.where(
            # When new hash code is available and old hash code is not available. 1 -- Y
            (
                    ~df['new_hash'].isna()
                    &
                    df['old_hash'].isna()
            ),
            'Y',
            np.where(
                # When hash codes are available and matched. 3.2 -- 'Y'
                (
                        ~df['new_hash'].isna()
                        &
                        ~df['old_hash'].isna()
                        &
                        df['new_hash'].ne(df['old_hash'])
                ),
                'Y',
                'N'
            )
        )
    )
)

但是出现多列错误（ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,)）。

添加多列怎么了？有人可以帮我吗？

Answer 1

在np.where(cond,A,B)中，Python评估cond，A和B中的每一个，然后将它们传递给where函数。 where然后broadcasts彼此相对输入，并执行逐元素选择。您似乎有3个嵌套的where。我猜该错误发生在最里面，因为它将首先被评估（我不必猜测是否提供了错误回溯。）

                    np.where(
                        # When hash codes are available and matched. 3.2 -- 'Y'
                        (
                                ~df['new_hash'].isna()
                                &
                                ~df['old_hash'].isna()
                                &
                                df['new_hash'].ne(df['old_hash'])
                        ),
                        ['Y', df['cr_date'], current_time],
                        ['N', df['cr_date'], df['up_date']]
                    )

cond部分是第一个()逻辑和表达式。

A是3元素列表，B是下一个列表。

假设有66行，则cond将具有（66，）形状。

np.array(['Y', df['cr_date'], current_time])可能是一个（3，）形状对象dtype数组，因为输入包含一个字符串，一个Series和一个时间对象。

这说明了错误消息中的3种形状：shapes (66,) (3,) (3,))

如果您尝试一次仅设置一列，则表达式将为np.where(cond, 'Y', 'N')或np.where(cond, Series1, Series2)。

如果您不了解broadcasting的含义（或错误），则可能需要详细了解numpy（构成pandas的基础）。

使用np.where子句向熊猫数据框添加多列

1 个答案: