我经常有缺少ID的数据框,如下所示:
ID Price
0 1000 900
1 1001 100
2 1002 150
3 NaN 600
我想对ID应用某种逻辑,以确定记录是否特殊,以获得这种输出:
ID Price Special ID?
0 1000 900 False
1 1001 100 False
2 1002 150 True
3 nan 600 False
我通常
尝试以字符串形式获取数据框*
使用numpy vectorize
但是,我一直遇到意外行为。
我在输入数据时指定了dtype=str
。 <-应该足够
我仍然会收到ValueError,指示输入正在使用vectorize
进行浮点读取。
我必须再次用astype(str)
转换列。 <-不需要的额外步骤? **
我对正在发生的事情有一个猜测***,但我首先想听听别人的消息。
您可以在下面运行的代码:
import pandas as pd
import numpy as np
# My data comes in with some empty IDs, but these rows are still usable.
data_with_nan = {'ID':['1000','1001','1002', np.nan],
'Price':['900','100','150', '600']}
# I set dtype to str.
df_with_nan = pd.DataFrame(data_with_nan,dtype=str)
# My console tells me that the ID column is 'object'. I interpret this to mean the
# column only contains objects (which apparently is pandas' shorthand for string).
#Seems to have worked.
df_with_nan.dtypes
def special_id(id):
"""Identify IDs that have 2 in them"""
# I assume that using the dtype of str converts np.nan to 'NaN'.
if '2' in id:
return True
else:
return False
df_with_nan['Special IDs'] = np.vectorize(special_id)(df_with_nan['ID'])
# However, this assumption was incorrect:
# TypeError: argument of type 'float' is not iterable
# Maybe I can use an if condition to check if the argument is none?
def special_id_with_check(id):
"""Identify IDs that have 2 in them"""
if id:
if '2' in id:
return True
df_with_nan['Special ID?'] = np.vectorize(special_id_with_check)(df_with_nan['ID'])
# This continues to return the same error:
# TypeError: argument of type 'float' is not iterable
# Therefore, I must explicitly cast this column as string (even though specifying dtype
# should have done this for me?)
df_with_nan['ID'] = df_with_nan['ID'].astype(str)
df_with_nan
df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'])
# Now it works.
*我的理解是,nan以浮点数形式出现,因此以浮点数形式导入数据框将继续使nans出现问题。我希望当我将一个数据帧作为nan的字符串时,成为“ NaN”
**您可能会问:“为什么不检查输入是否为null?”我有,但是以某种方式使用vectorize
时仍然收到ValueError。
***我的猜测是,正在发生的事情是dtype仅转换非null值。在这种情况下,我真正应该做的是将dtype保留下来,然后在函数调用的最后一分钟转换为字符串,如下所示-
df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'].astype(str))
这种方法使我感到很奇怪。我宁愿预先弄清所有类型的东西。
答案 0 :(得分:0)
export GOOGLE_APPLICATION_CREDENTIALS=path_to_your_key_file.json
类型为浮点型,因此“ ID”列包含浮点型和字符串。如第一条评论中所述,您应该尝试避免使用向量化,您可以简单地
np.nan
不需要转换。
请注意,您可以通过运行df_with_nan['Special ID?'] = pd.isnull(df_with_nan['ID'])
来检查值的实际类型,其中type(df_with_nan['ID'][row_idx])
是整数行索引,对于您称为非特殊ID的值,它将是row_idx
以及str
作为特殊ID。