令人困惑的Pandas crosstab()函数的行为与包含NaN值的数据帧

时间:2015-10-23 13:18:47

标签: python pandas dataframe nan crosstab

我使用Python 3.4.1和numpy 0.10.1以及pandas 0.17.0。我有一个大型数据框,列出了个体动物的种类和性别。它是一个真实的数据集,并且不可避免地存在由NaN表示的缺失值。可以生成简化版数据:

import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                        'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
                        'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})

打印数据框给出:

    gender  id species
0     male   1     dog
1   female   2     dog
2   female   3     NaN
3     male   4     dog
4     male   5     dog
5   female   6     cat
6   female   7     cat
7      NaN   8     cat
8     male   9     dog
9     male  10     cat
10  female  11     cat
11    male  12     dog
12  female  13     dog
13  female  14     dog
14    male  15     dog
15  female  16     NaN
16    male  17     cat
17  female  18     cat
18     NaN  19     dog
19    male  20     dog

我想使用以下内容生成一个交叉表格,以显示每个物种中的雄性和雌性数量:

pd.crosstab(tempDF['species'],tempDF['gender'])

这会产生下表:

gender   female  male
species              
cat           4     2
dog           3     7

这是我所期待的。但是,如果我包含marginins = True选项,它会产生:

pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)

gender   female  male  All
species                   
cat           4     2    7
dog           3     7   11
All           9     9   20

如您所见,边际总数似乎不正确,可能是由数据框中缺少的数据引起的。这是预期的行为吗?在我看来,它似乎很混乱。当然,边际总数应该是表中显示的行和列的总和,并且不包括表中未表示的任何缺失数据。包括dropna = False不会影响结果。

我可以在创建表之前删除任何带有NaN的行,但这似乎需要做很多额外的工作,并且在进行分析时需要考虑很多额外的事情。我应该将此报告为错误吗?

2 个答案:

答案 0 :(得分:3)

我想一个解决方法是将NaN转换为“缺少”#39;在创建表格之前,交叉管道将包含专门用于缺失值的列和行:

pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)

gender   female  male  missing  All
species                            
cat           4     2        1    7
dog           3     7        1   11
missing       2     0        0    2
All           9     9        2   20

就个人而言,我希望看到默认行为,因此我不必记住在每个交叉表计算中替换所有NaN。

答案 1 :(得分:2)

你不是唯一一个经历过这种情况的人。 它不仅发生在pd.crosstab,还发生在pd.pivot_table和DataFrame.groupby

在文档中,它说的是关于groupby排除Na的:

  

GroupBy中的NA组被自动排除。这种行为是   例如,与R一致。

你可以在这篇文章中找到一些好的解决方案: groupby columns with NaN (missing) values

也许有一天有人会解决这个问题:https://github.com/pandas-dev/pandas/issues/10772