Question

我有一个庞大的数据框架。我试图在这里构建一个类似于它的多索引数据帧。我需要根据每个索引和列获取NaN的数量。

temp = pd.DataFrame({'tic': ['IBM', 'AAPL', 'AAPL', 'IBM', 'AAPL'],
                   'industry': ['A', 'B', 'B', 'A', 'B'],
                    'price': [np.nan, 5, 6, 11, np.nan],
                    'shares':[100, 60, np.nan, 100, 62],
                    'dates': pd.to_datetime(['1990-01-01', '1990-01-01','1990-04-01', 
                                                 '1990-04-01', '1990-08-01'])
                    })

temp.set_index(['tic', 'dates'], inplace=True)

产生：

                industry  price  shares
tic  dates                             
IBM  1990-01-01        A    NaN   100.0
AAPL 1990-01-01        B    5.0    60.0
     1990-04-01        B    6.0     NaN
IBM  1990-04-01        A   11.0   100.0
AAPL 1990-08-01        B    NaN    62.0

以下是问题：

1）小问题：为什么索引不起作用？我希望在IBM列中看到一个AAPL和tic。

2）如何在每列上获得NaN s与每个tic的总数据点的比率？所以，我需要一个像这样的数据框：

tic                                     IBM              AAPL 
number of total NaNs                    1                2 
percentage of NaNs in 'price' column    50%(1 out of 2)  33.3% (1 out 3)
percentage of NaNs in 'Shares' column   0% (0 out 2)     33.3% (1 out 3)

3）如何根据NaN列price的比例对抽动进行排名？

4）如何在两列上选择NaN s的最低比率的前n次。

5）我如何在两个日期之间完成上述工作？

Answer 1

1）为什么索引不起作用？

temp.sort_index()

2）我怎样才能得到NaNs的比例？

grpd = temp.groupby(level='tic').agg(['size', 'count'])

null_ratio = grpd.xs('count', axis=1, level=1) \
        .div(grpd.xs('size', axis=1, level=1)).mul(-1).__radd__(1)

null_ratio

3）在价格列中按空格排名？

null_ratio.price.rank()

tic
AAPL    1.0
IBM     2.0
Name: price, dtype: float64

4）如何在两列上选择NaNs比率最低的顶部？

null_ratio.price.nsmallest(1)

tic
AAPL    0.333333
Name: price, dtype: float64

5）日期之间

temp.sort_index().loc[pd.IndexSlice[:, '1990-01-01':'1990-04-01'], :]

Answer 2

您可以使用sort_level功能来达到您想要的顺序 temp.sort_level('tic', inplace=True)
temp.sort_level(['tic', 'dates'], inplace=True)
df = pd.DataFrame({'total_missing': temp_grpd.apply(lambda x: x['price'].isnull().sum() + x['shares'].isnull().sum()), 'pnt_missing_price': temp_grpd.apply(lambda x: x['price'].isnull().sum()/x.shape[0]), 'pnt_missing_shares': temp_grpd.apply(lambda x: x['shares'].isnull().sum()/x.shape[0]), 'total_records': temp_grpd.apply(lambda x: x.shape[0])})

如果您需要，可以转置数据框以匹配帖子中包含的格式，但以这种格式操作可能会更容易。

df['pnt_missing_price'].rank(ascending=False)
问题没有明确定义。我认为您可能需要以下内容，但目前尚不清楚。

df['pnt_missing'] = df['total_missing']/df['total_records'] df.sort_values('pnt_missing', ascending=True) df.loc[df['pnt_missing'].nsmallest(5)]
你已经通过piRSquared得到了一个很好的答案。

使用多索引数据框的问题

2 个答案: