基于str.contains隔离相邻列

时间:2017-07-24 21:33:13

标签: python python-3.x pandas numpy

大家好,所以我的数据框看起来像这样:

 A |  B   |   C | D | E
    'USD'
   'trading expenses-total'   
      8.10   2.3   5.5
      9.1    1.4   6.1
      5.4    5.1   7.8

我还没有发现任何类似的内容,如果这是重复的话,请道歉。但基本上我试图找到包含字符串' total'的列。 (列B)及其相邻列(C和D)并将它们转换为数据帧。我觉得我很接近以下代码:

test.loc[:,test.columns.str.contains('total')]

隔离了正确的列,但我无法弄清楚如何抓住相邻的两列。我想要的输出是:

 B   |                      C  |  D 
'USD'
'trading expenses-total'   
 8.10                       2.3   5.5
 9.1                        1.4   6.1
 5.4                        5.1   7.8

2 个答案:

答案 0 :(得分:3)

这是一种方法 -

from scipy.ndimage.morphology import binary_dilation as bind

mask = test.columns.str.contains('total')
test_out = test.iloc[:,bind(mask,[1,1,1],origin=-1)]

如果您无法访问SciPy,也可以使用np.convolve,就像这样 -

test_out = test.iloc[:,np.convolve(mask,[1,1,1])[:-2]>0]

示例运行

案例#1:

In [390]: np.random.seed(1234)

In [391]: test = pd.DataFrame(np.random.randint(0,9,(3,5)))

In [392]: test.columns = [['P','total001','g','r','t']]

In [393]: test
Out[393]: 
   P  total001  g  r  t
0  3         6  5  4  8
1  1         7  6  8  0
2  5         0  6  2  0

In [394]: mask = test.columns.str.contains('total')

In [395]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[395]: 
   total001  g  r
0         6  5  4
1         7  6  8
2         0  6  2

案例#2:

如果您有多个匹配的列,并且如果您超出限制且匹配列的右侧没有两列,则此功能也有效 -

In [401]: np.random.seed(1234)

In [402]: test = pd.DataFrame(np.random.randint(0,9,(3,7)))

In [403]: test.columns = [['P','total001','g','r','t','total002','k']]

In [406]: test
Out[406]: 
   P  total001  g  r  t  total002  k
0  3         6  5  4  8         1  7
1  6         8  0  5  0         6  2
2  0         5  2  6  3         7  0

In [407]: mask = test.columns.str.contains('total')

In [408]: test.iloc[:,bind(mask,[1,1,1],origin=-1)]
Out[408]: 
   total001  g  r  total002  k
0         6  5  4         1  7
1         8  0  5         6  2
2         5  2  6         7  0

答案 1 :(得分:3)

OLD回答:

熊猫方法:

In [36]: df = pd.DataFrame(np.random.rand(3,5), columns=['A','total','C','D','E'])

In [37]: df
Out[37]:
          A     total         C         D         E
0  0.789482  0.427260  0.169065  0.112993  0.142648
1  0.303391  0.484157  0.454579  0.410785  0.827571
2  0.984273  0.001532  0.676777  0.026324  0.094534

In [38]: idx = np.argmax(df.columns.str.contains('total'))

In [39]: df.iloc[:, idx:idx+3]
Out[39]:
      total         C         D
0  0.427260  0.169065  0.112993
1  0.484157  0.454579  0.410785
2  0.001532  0.676777  0.026324

<强>更新

In [118]: df
Out[118]:
     A                       B    C    D     E
0  NaN                     USD  NaN  NaN   NaN
1  NaN  trading expenses-total  NaN  NaN   NaN
2    A                    8.10  2.3  5.5  10.0
3    B                     9.1  1.4  6.1  11.0
4    C                     5.4  5.1  7.8  12.0

In [119]: col = df.select_dtypes(['object']).apply(lambda x: x.str.contains('total').any()).idxmax()

In [120]: cols = df.columns.to_series().loc[col:].head(3).tolist()

In [121]: col
Out[121]: 'B'

In [122]: cols
Out[122]: ['B', 'C', 'D']

In [123]: df[cols]
Out[123]:
                        B    C    D
0                     USD  NaN  NaN
1  trading expenses-total  NaN  NaN
2                    8.10  2.3  5.5
3                     9.1  1.4  6.1
4                     5.4  5.1  7.8