通过MultiIndex检索数据

时间:2015-10-20 18:35:31

标签: pandas

我有一个包含多索引的数据框。我需要根据模式和/或脚本(索引为schemascript)处理各种数据子集。数据框如下所示:

                          tx_id  step  step_id          start_time                                                     
schema_10 cmc_v2_file      19-3    10      279 2015-09-04 00:46:30   
          cmc_v2_file       2-7    10      423 2015-09-04 00:46:22   
          cmc_v2_file      29-1    10       20 2015-09-04 00:46:34   
          cmc_v2_file      35-1     4       63 2015-09-04 00:46:51   
          cmc_v2_file      31-2    10       79 2015-09-04 00:46:54   
          cmc_v2_file       5-8    10      536 2015-09-04 00:46:57   
          cmc_v2_file       5-9    10      610 2015-09-04 00:47:13   
          cmc_v2_file      39-1    10      178 2015-09-04 00:47:12   
          cmc_v2_file      41-1    10      211 2015-09-04 00:47:22   
          cmc_v2_file      21-4    10      678 2015-09-04 00:47:28   
          cmc_v2_file      23-4    10      698 2015-09-04 00:47:31   
          cmc_v2_file      31-5    10      399 2015-09-04 00:47:45   
          cmc_v2_file      35-4     3      453 2015-09-04 00:47:54   
          cmc_v2_file      29-5     4      461 2015-09-04 00:47:54   
          cmc_v2_file      29-5     8      465 2015-09-04 00:47:55   
          cmc_v2_file      42-3     1      467 2015-09-04 00:47:57   
          cmc_v2_file      22-5     8      866 2015-09-04 00:47:53   
          cmc_v2_file      16-6     8      893 2015-09-04 00:47:51   
          cmc_v2_file      17-6     4      938 2015-09-04 00:47:54   
          cmc_v2_file      17-6     8      942 2015-09-04 00:47:55   
          cmc_v2_file       6-2    10      707 2015-09-04 00:47:50   
          cmc_v2_file      4-11    10      730 2015-09-04 00:47:54   
          cmc_v2_file       6-3     2      745 2015-09-04 00:47:53   
          cmc_v2_file      5-11     1      762 2015-09-04 00:47:55   
          cmc_v2_file      4-12     1      763 2015-09-04 00:47:56   
          cmc_v2_file      5-12    10      782 2015-09-04 00:48:16   
          cmc_v2_file      31-6     4      471 2015-09-04 00:47:55   
          cmc_v2_file      38-3     4      520 2015-09-04 00:47:51   
          cmc_v2_file      39-3     4      551 2015-09-04 00:47:55   
          cmc_v2_file      31-7    10      570 2015-09-04 00:48:20   
...                         ...   ...      ...                 ...   
schema_9  hcs-vbu      1332-132    14   197542 2015-09-04 00:29:46   
          hcs-vbu       515-143     5   196309 2015-09-04 00:29:01   
          hcs-vbu       552-126    13   196333 2015-09-04 00:29:19   
          hcs-vbu       559-116    12   197068 2015-09-04 00:29:33   
          hcs-vbu       566-115    13   197201 2015-09-04 00:29:47   
          hcs-vbu       523-152     3   197443 2015-09-04 00:29:33   
          hcs-vbu       790-136     2   200774 2015-09-04 00:28:46   
          hcs-vbu       790-136     4   200776 2015-09-04 00:28:56   
          hcs-vbu       790-136    12   200784 2015-09-04 00:29:13   
          hcs-vbu       206-148     5   198213 2015-09-04 00:29:04   

为了获取特定脚本的数据,我这样做:

df.loc(axis=0)[:,[script]]

当我打印出整个数据帧时,它看起来是正确的。问题是我也在为所有这些编写单元测试,对于部分测试,我想验证数据只包含一个脚本:

scripts = df.index.levels[df.index.names.index('script')]

但是,不是像我预期的那样返回一个列表,而是获得一个6的列表,它是原始未过滤数据中的脚本数。一旦通过调用.loc过滤数据框,我是否应该以不同的方式检索脚本索引?

1 个答案:

答案 0 :(得分:0)

您的第二个语句df.index.levels获取索引中的所有级别。然后你通过说,给我第二个多索引中的所有级别(称为“脚本”)来对它进行子集化。

我认为你想要的是这样的,你说,对于名为'script'的索引,给我一个特定的值。

## here we set a specific value you want to filter with

specific_script_value = cmc_v2_file

## and then we filter in the second dimension of the index. 
## The indexer helps slice in several dimensions

idx=pd.IndexSlice
df.loc[idx[:,specific_script_value],:]