我正在尝试在ETL逻辑中使用numpy.where()将多个列添加到数据框。
这是我的df:
我正在尝试将我的df作为:
代码是:
current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df = pd.concat(
[
df,
pd.DataFrame(
[
np.where(
# When old hash code is available and new hash code is not available. 0 -- N
(
df['new_hash'].isna()
&
~df['old_hash'].isna()
) |
# When hash codes are available and matched. 3.1 -- 'N'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
~(df['new_hash'].ne(df['old_hash']))
),
['N', df['cr_date'], df['up_date']],
np.where(
# When new hash code is available and old hash code is not available. 1 -- Y
(
~df['new_hash'].isna()
&
df['old_hash'].isna()
),
['Y', current_time, current_time],
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
['Y', df['cr_date'], current_time],
['N', df['cr_date'], df['up_date']]
)
)
)
],
index=df.index,
columns=['is_changed', 'cr_date_new', 'up_date_new']
)
],
axis=1
)
使用df.join()
而不是pd.concat()
尝试了以上代码。仍然给我以下指定的ValueError
我可以一次添加一列。例子是:
df['is_changed'] = (
np.where(
# When old hash code is available and new hash code is not available. 0 -- N
(
df['new_hash'].isna()
&
~df['old_hash'].isna()
) |
# When hash codes are available and matched. 3.1 -- 'N'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
~(df['new_hash'].ne(df['old_hash']))
),
'N',
np.where(
# When new hash code is available and old hash code is not available. 1 -- Y
(
~df['new_hash'].isna()
&
df['old_hash'].isna()
),
'Y',
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
'Y',
'N'
)
)
)
)
但是出现多列错误(ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,)
)。
添加多列怎么了?有人可以帮我吗?
答案 0 :(得分:1)
在np.where(cond,A,B)
中,Python评估cond
,A
和B
中的每一个,然后将它们传递给where
函数。 where
然后broadcasts
彼此相对输入,并执行逐元素选择。您似乎有3个嵌套的where
。我猜该错误发生在最里面,因为它将首先被评估(我不必猜测是否提供了错误回溯。)
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
['Y', df['cr_date'], current_time],
['N', df['cr_date'], df['up_date']]
)
cond
部分是第一个()
逻辑和表达式。
A
是3元素列表,B
是下一个列表。
假设有66行,则cond
将具有(66,)形状。
np.array(['Y', df['cr_date'], current_time])
可能是一个(3,)形状对象dtype数组,因为输入包含一个字符串,一个Series和一个时间对象。
这说明了错误消息中的3种形状:shapes (66,) (3,) (3,))
如果您尝试一次仅设置一列,则表达式将为np.where(cond, 'Y', 'N')
或np.where(cond, Series1, Series2)
。
如果您不了解broadcasting
的含义(或错误),则可能需要详细了解numpy
(构成pandas
的基础)。