需要将pandas dataframe列中的可变长度数据拆分为多个列

时间:2017-10-24 03:27:55

标签: python pandas dataframe

我有2列数据框这样:

ITEM        REFNUMS
1   00000299    0036701923024762922029229294652954429569295832...
2   00000655    NaN
24  00001791    00016027123076000158004563065131972
25  00001805    00016027123076000158004563065131972
26  00001813    00016027123076000158004563065131972
27  00001821    00016027123076000158004563065131972
28  00001937    0142530521316303164702509000510012201310027820...

我想将REFNUMS列拆分为可分割部分,并在可能的情况下添加到现有数据框中,因为我需要保留行索引并匹配ITEM#。 REFNUMS中的数据是5可以整除的长度,而不是NaN,因此例如第1行= 78套5。

data_len = (data['REFNUMS'].str.len())/5 

然后

0         NaN
1        78.0
2         NaN

感谢有关如何执行此操作的任何建议。

1 个答案:

答案 0 :(得分:1)

IIUC,您可以使用str.extractall获取5位数的组,清理列,然后加入:

In [168]: r = df.REFNUMS.str.extractall("(\d{1,5})").unstack()

In [169]: r.columns = r.columns.droplevel(0)

In [170]: df.join(r)
Out[170]: 
    ITEM                                            REFNUMS      0      1      2      3      4      5      6      7      8     9
1    299  0036701923024762922029229294652954429569295832...  00367  01923  02476  29220  29229  29465  29544  29569  29583     2
2    655                                                NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   NaN
24  1791                00016027123076000158004563065131972  00016  02712  30760  00158  00456  30651  31972   None   None  None
25  1805                00016027123076000158004563065131972  00016  02712  30760  00158  00456  30651  31972   None   None  None
26  1813                00016027123076000158004563065131972  00016  02712  30760  00158  00456  30651  31972   None   None  None
27  1821                00016027123076000158004563065131972  00016  02712  30760  00158  00456  30651  31972   None   None  None
28  1937  0142530521316303164702509000510012201310027820...  01425  30521  31630  31647  02509  00051  00122  01310  02782     0