根据条件选择pandas数据帧的列

时间:2016-08-20 15:41:02

标签: pandas

我有一张DF,其中包含英国选举结果的结果,每一方有一栏。 DF就像这样:

In[107]: Results.columns
Out[107]: 
Index(['Press Association ID Number', 'Constituency Name', 'Region', 'Country',
       'Constituency ID', 'Constituency Type', 'Election Year', 'Electorate',
       ' Total number of valid votes counted ', 'Unnamed: 9',
       ...
       'Wessex Reg', 'Whig', 'Wigan', 'Worth', 'WP', 'WRP', 'WVPTFP', 'Yorks',
       'Young', 'Zeb'],
      dtype='object', length=147)

e.g。

Results.head(2)
Out[108]: 
   Press Association ID Number Constituency Name Region Country  \
0                            1          Aberavon  Wales   Wales   
1                            2         Aberconwy  Wales   Wales   

  Constituency ID Constituency Type  Election Year Electorate  \
0       W07000049            County           2015     49,821   
1       W07000058            County           2015     45,525   

   Total number of valid votes counted   Unnamed: 9 ...   Wessex Reg  Whig  \
0                                31,523         NaN ...          NaN   NaN   
1                                30,148         NaN ...          NaN   NaN   

   Wigan  Worth  WP  WRP  WVPTFP  Yorks  Young  Zeb  
0    NaN    NaN NaN  NaN     NaN    NaN    NaN  NaN  
1    NaN    NaN NaN  NaN     NaN    NaN    NaN  NaN  

[2 rows x 147 columns]

包含不同方投票的列为Results.ix[:, 'Unnamed: 9':]

这些政党中的大多数在任何选区投票的票数都很少,所以我想将它们排除在外。是否有一种方法(自己不能遍历每一行和每列)仅返回满足特定条件的那些列,例如具有至少一个值> 1000?我希望能够指定像

这样的东西
    Results.ix[:, 'Unnamed: 9': > 1000]

1 个答案:

答案 0 :(得分:1)

你可以这样做:

In [94]: df
Out[94]:
          a         b         c         d           e         f         g           h
0 -1.450976 -1.361099 -0.411566  0.955718   99.882051 -1.166773 -0.468792  100.333169
1  0.049437 -0.169827  0.692466 -1.441196    0.446337 -2.134966 -0.407058   -0.251068
2 -0.084493 -2.145212 -0.634506  0.697951  101.279115 -0.442328 -0.470583   99.392245
3 -1.604788 -1.136284 -0.680803 -0.196149    2.224444 -0.117834 -0.299730   -0.098353
4 -0.751079 -0.732554  1.235118 -0.427149   99.899120  1.742388 -1.636730   99.822745
5  0.955484 -0.261814 -0.272451  1.039296    0.778508 -2.591915 -0.116368   -0.122376
6  0.395136 -1.155138 -0.065242 -0.519787  100.446026  1.584397  0.448349   99.831206
7 -0.691550  0.052180  0.827145  1.531527   -0.240848  1.832925 -0.801922   -0.298888
8 -0.673087 -0.791235 -1.475404  2.232781  101.521333 -0.424294  0.088186   99.553973
9  1.648968 -1.129342 -1.373288 -2.683352    0.598885  0.306705 -1.742007   -0.161067

In [95]: df[df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]]
Out[95]:
            e           h
0   99.882051  100.333169
1    0.446337   -0.251068
2  101.279115   99.392245
3    2.224444   -0.098353
4   99.899120   99.822745
5    0.778508   -0.122376
6  100.446026   99.831206
7   -0.240848   -0.298888
8  101.521333   99.553973
9    0.598885   -0.161067

说明:

In [96]: (df.loc[:, 'e':] > 50).any()
Out[96]:
e     True
f    False
g    False
h     True
dtype: bool

In [97]: df.loc[:, 'e':].columns
Out[97]: Index(['e', 'f', 'g', 'h'], dtype='object')

In [98]: df.loc[:, 'e':].columns[(df.loc[:, 'e':] > 50).any()]
Out[98]: Index(['e', 'h'], dtype='object')

设定:

In [99]: df = pd.DataFrame(np.random.randn(10, 8), columns=list('abcdefgh'))

In [100]: df.loc[::2, list('eh')] += 100

<强>更新

从Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers开始。