Question

我经常有缺少ID的数据框，如下所示：

     ID Price
0  1000   900
1  1001   100
2  1002   150
3   NaN   600

我想对ID应用某种逻辑，以确定记录是否特殊，以获得这种输出：

     ID Price  Special ID?
0  1000   900        False
1  1001   100        False
2  1002   150         True
3   nan   600        False

我通常

尝试以字符串形式获取数据框*
使用numpy vectorize

但是，我一直遇到意外行为。

我在输入数据时指定了dtype=str。 <-应该足够
我仍然会收到ValueError，指示输入正在使用vectorize进行浮点读取。
我必须再次用astype(str)转换列。 <-不需要的额外步骤？ **

我对正在发生的事情有一个猜测***，但我首先想听听别人的消息。

您可以在下面运行的代码：

import pandas as pd
import numpy as np

# My data comes in with some empty IDs, but these rows are still usable.
data_with_nan = {'ID':['1000','1001','1002', np.nan],
                 'Price':['900','100','150', '600']}

# I set dtype to str.
df_with_nan = pd.DataFrame(data_with_nan,dtype=str)

# My console tells me that the ID column is 'object'. I interpret this to mean the
# column only contains objects (which apparently is pandas' shorthand for string). 
#Seems to have worked.
df_with_nan.dtypes

def special_id(id):
    """Identify IDs that have 2 in them"""
    # I assume that using the dtype of str converts np.nan to 'NaN'.
    if '2' in id:
        return True
    else:
        return False

df_with_nan['Special IDs'] = np.vectorize(special_id)(df_with_nan['ID'])
# However, this assumption was incorrect:
# TypeError: argument of type 'float' is not iterable

# Maybe I can use an if condition to check if the argument is none?
def special_id_with_check(id):
    """Identify IDs that have 2 in them"""
    if id:
        if '2' in id:
            return True

df_with_nan['Special ID?'] = np.vectorize(special_id_with_check)(df_with_nan['ID'])
# This continues to return the same error:
# TypeError: argument of type 'float' is not iterable

# Therefore, I must explicitly cast this column as string (even though specifying dtype
# should have done this for me?)
df_with_nan['ID'] = df_with_nan['ID'].astype(str)

df_with_nan
df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'])
# Now it works.

*我的理解是，nan以浮点数形式出现，因此以浮点数形式导入数据框将继续使nans出现问题。我希望当我将一个数据帧作为nan的字符串时，成为“ NaN”

**您可能会问：“为什么不检查输入是否为null？”我有，但是以某种方式使用vectorize时仍然收到ValueError。

***我的猜测是，正在发生的事情是dtype仅转换非null值。在这种情况下，我真正应该做的是将dtype保留下来，然后在函数调用的最后一分钟转换为字符串，如下所示-

df_with_nan['Special ID?'] = np.vectorize(special_id)(df_with_nan['ID'].astype(str))

这种方法使我感到很奇怪。我宁愿预先弄清所有类型的东西。

Answer 1

export GOOGLE_APPLICATION_CREDENTIALS=path_to_your_key_file.json类型为浮点型，因此“ ID”列包含浮点型和字符串。如第一条评论中所述，您应该尝试避免使用向量化，您可以简单地

np.nan

不需要转换。

请注意，您可以通过运行df_with_nan['Special ID?'] = pd.isnull(df_with_nan['ID'])来检查值的实际类型，其中type(df_with_nan['ID'][row_idx])是整数行索引，对于您称为非特殊ID的值，它将是row_idx以及str作为特殊ID。

使用nan的数据帧操作：dtype不起作用，向量化问题

1 个答案: