使用nan的数据帧操作:dtype不起作用,向量化问题

时间:2019-12-10 17:27:05

标签: python pandas numpy

我经常有缺少ID的数据框,如下所示:

     ID Price
0  1000   900
1  1001   100
2  1002   150
3   NaN   600

我想对ID应用某种逻辑,以确定记录是否特殊,以获得这种输出:

     ID Price  Special ID?
0  1000   900        False
1  1001   100        False
2  1002   150         True
3   nan   600        False

我通常

  1. 尝试以字符串形式获取数据框*

  2. 使用numpy vectorize

  3. 应用函数

但是,我一直遇到意外行为。

  1. 我在输入数据时指定了dtype=str。 <-应该足够

  2. 我仍然会收到ValueError,指示输入正在使用vectorize进行浮点读取。

  3. 我必须再次用astype(str)转换列。 <-不需要的额外步骤? **

我对正在发生的事情有一个猜测***,但我首先想听听别人的消息。

您可以在下面运行的代码:

import pandas as pd
import numpy as np

# My data comes in with some empty IDs, but these rows are still usable.
data_with_nan = {'ID':['1000','1001','1002', np.nan],
                 'Price':['900','100','150', '600']}

# I set dtype to str.
df_with_nan = pd.DataFrame(data_with_nan,dtype=str)

# My console tells me that the ID column is 'object'. I interpret this to mean the
# column only contains objects (which apparently is pandas' shorthand for string). 
#Seems to have worked.
df_with_nan.dtypes

def special_id(id):
    """Identify IDs that have 2 in them"""
    # I assume that using the dtype of str converts np.nan to 'NaN'.
    if '2' in id:
        return True
    else:
        return False

df_with_nan['Special IDs'] = np.vectorize(special_id)(df_with_nan['ID'])
# However, this assumption was incorrect:
# TypeError: argument of type 'float' is not iterable

# Maybe I can use an if condition to check if the argument is none?
def special_id_with_check(id):
    """Identify IDs that have 2 in them"""
    if id:
        if '2' in id:
            return True

df_with_nan['Special ID?'] = np.vectorize(special_id_with_check)(df_with_nan['ID'])
# This continues to return the same error:
# TypeError: argument of type 'float' is not iterable

# Therefore, I must explicitly cast this column as string (even though specifying dtype
# should have done this for me?)
df_with_nan['ID'] = df_with_nan['ID'].astype(str)

df_with_nan
df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'])
# Now it works.

*我的理解是,nan以浮点数形式出现,因此以浮点数形式导入数据框将继续使nans出现问题。我希望当我将一个数据帧作为nan的字符串时,成为“ NaN”

**您可能会问:“为什么不检查输入是否为null?”我有,但是以某种方式使用vectorize时仍然收到ValueError。

***我的猜测是,正在发生的事情是dtype仅转换非null值。在这种情况下,我真正应该做的是将dtype保留下来,然后在函数调用的最后一分钟转换为字符串,如下所示-

df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'].astype(str))

这种方法使我感到很奇怪。我宁愿预先弄清所有类型的东西。

1 个答案:

答案 0 :(得分:0)

export GOOGLE_APPLICATION_CREDENTIALS=path_to_your_key_file.json类型为浮点型,因此“ ID”列包含浮点型和字符串。如第一条评论中所述,您应该尝试避免使用向量化,您可以简单地

np.nan

不需要转换。

请注意,您可以通过运行df_with_nan['Special ID?'] = pd.isnull(df_with_nan['ID']) 来检查值的实际类型,其中type(df_with_nan['ID'][row_idx])是整数行索引,对于您称为非特殊ID的值,它将是row_idx以及str作为特殊ID。