Question

在熊猫工作时，我遇到了一些非常奇怪的缺失值行为，这让我感觉很快。

请注意以下事项：

import pandas as pd
import numpy as np
from numpy import nan as NA
from pandas import DataFame

In [1]: L1 = [NA, NA]
In [2]: L1
Out[2]: [nan, nan]
In [3]: set(L1)
Out[3]: {nan}

到目前为止，所有这些都如预期的那样好，列表L1的集合包含单个NA值。但是现在，当你做同样的事情但基于从数据框系列中提取的列表时，我完全不知所措。

In [4]: EG = DataFrame(np.random.rand(10), columns = ['Data'])
In [5]: EG['Data'][5:7] = NA
In [6]: L2 = list(EG['Data'][5:7])
In [7]: L2
Out[8]: [nan, nan]
In [9]: set(L2)
Out[9]: {nan, nan}

这里发生了什么？当它们所基于的列表看起来完全相同时，为什么这些集合是不同的？

我做了一些挖掘，想到可能类型不同（考虑到NA值是在我看来完全相同的方式创建的，这看起来会令人惊讶）。请参阅以下内容：

In [10]: type(L1[0])
Out[10]: float
In [11]: type(L1[1])
Out[11]: float
In [12]: type(L2[0])
Out[12]: numpy.float64
In [13]: type(L2[1])
Out[13]: numpy.float64

所以很明显这些类型是不同的，这已经让我大吃一惊，但是如果我将L2的每个元素转换成一个浮点就像在L1中一样，奇数集行为应该消失：

In [14]: L3 = [float(elem) for elem in L2]
In [15]: L3
Out[15]: [nan, nan]
In [16]: type(L3[0])
Out[16]: float
In [17]: type(L3[1])
Out[17]: float 
In [18]: set(L3)
Out[18]: {nan, nan}

即使L3中的元素类型与L1中的元素类型完全相同，问题仍然存在。

有人可以帮忙吗？

在使用groupby聚合数据时，我依赖于set（L）的常规功能。我注意到这个问题，它让我发疯。我有兴趣了解周围的工作，但我更想知道这里发生了什么......

请帮助...

编辑：对用户评论的回应我发布了我实际上在尝试聚合数据的代码。我不确定这会改变问题的维度，但它可能会理解为什么会这样：“

def NoActionRequired(x):
""" This function is used to aggregate the data that is believed to be equal within multi line/day groups. It puts the data 
    into a list and then if that list forms a set of length 1 (which it must if the data are in fact equal) then the single
    value contained in the set is returned, otherwise the list is returned. This allows for the fact that we may be wrong about
    the equality of the data, and it is something that can be tested after aggreagation."""

    L = list(x)
    S = set(L)
    if len(S) == 1:
        return S.pop()
    else:
        return L

DFGrouped['Data'].agg(NoActionRequired)

这个想法是，如果组中的所有数据都相同，则返回单个值，否则返回数据列表。

Answer 1

我现在看到的唯一解释是第一个列表中的所有NA都是相同的对象：

>>> L1 = [NA, NA]
>>> L1
[nan, nan]
>>> L1[0] is L1[1]
True

第二个列表中的对象是不同的对象：

>>> L2 = list(pd.Series([NA, NA]))
>>> L2
[nan, nan]
>>> L2[0] is L2[1]
False

关于你的功能，我建议使用pandas.Series.unique()代替set，例如：

def NoActionRequired(x):
    # ...    
    S = x.unique()
    if len(S) == 1:
        return S[0]
    else:
        return list(x)

unique()看起来NaN效果很好：

>>> pd.Series([NA, NA]).unique()
array([ nan])

编辑检查NA是否在列表中，您可以使用np.isnan（）函数：

>>> L = [NA, 1, 2]
>>> np.isnan(L)
array([ True, False, False], dtype=bool)

从DataFrame / Series中绘制时遗漏值的奇怪行为

1 个答案: