正则表达式 - 使用Pandas&蟒蛇

时间:2017-11-10 19:26:50

标签: python regex pandas parsing

我在pandas dataframe df.spec中有一个列,其中包含以下形式的字符串(例如三行):

'PART A TO PART B - 2 features out of tolerance: A12C(dev=-3.7 mm) A14D(dev=-4.1 mm)'

'PART C TO PART B - 1 feature out of tolerance: A14C(dev=-1.8 mm)'

'PART Z-X TO PART C - 1 feature out of tolerance: A25C(dev=-6.2 mm)'

我希望能够使用如下形式将数据解析为数据框:

AREA            | POINT         | MEASUREMENT

PART A TO PART B    | A12C          | -3.7

PART A TO PART B    | A12D          | -4.1

PART C TO PART B    | A14C          | -1.8

PART Z-X TO PART C  | A25C          | -6.2 

有人可以帮我向我解释如何实现这个目标吗?

1 个答案:

答案 0 :(得分:0)

我会使用str.extract

注意:如果您知道一行中最多有N个项目,则可以将OPTION(ROWCOUNT=x)替换为range(1, 2)

range(1, N)

最后,我们填写那些大于0的部分:

In [11]: s
Out[11]:
0    PART A TO PART B - 2 features out of tolerance...
1    PART C TO PART B - 1 feature out of tolerance:...
2    PART Z-X TO PART C - 1 feature out of toleranc...
dtype: object

In [12]: def chunk(i):
     ...:     return r'(?P<junk_{}>\s(?P<number_{}>.*?)\(dev=(?P<size_{}>-?[\.0-9]+) mm\))'.format(i, i, i)
     ...:

In [13]: df = s.str.extract("(?P<part>.*?)\s-.*?:{}?.*?".format(chunk(0) + "?".join((chunk(i) for i in range(1, 2)) )), expand=True)

In [14]: df
Out[14]:
                 part              junk_0 number_0 size_0              junk_1 number_1 size_1
0    PART A TO PART B   A12C(dev=-3.7 mm)     A12C   -3.7   A14D(dev=-4.1 mm)     A14D   -4.1
1    PART C TO PART B   A14C(dev=-1.8 mm)     A14C   -1.8                 NaN      NaN    NaN
2  PART Z-X TO PART C   A25C(dev=-6.2 mm)     A25C   -6.2                 NaN      NaN    NaN

In [15]: df = s.str.extract("(?P<part_0>.*?)\s-.*?:{}?.*?".format(chunk(0) + "?".join((chunk(i) for i in range(1, 2)) )), expand=True)

In [16]: df
Out[16]:
               part_0              junk_0 number_0 size_0              junk_1 number_1 size_1
0    PART A TO PART B   A12C(dev=-3.7 mm)     A12C   -3.7   A14D(dev=-4.1 mm)     A14D   -4.1
1    PART C TO PART B   A14C(dev=-1.8 mm)     A14C   -1.8                 NaN      NaN    NaN
2  PART Z-X TO PART C   A25C(dev=-6.2 mm)     A25C   -6.2                 NaN      NaN    NaN

In [17]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("_"))))

In [18]: df
Out[18]:
                 part                junk number  size                junk number  size
                    0                   0      0     0                   1      1     1
0    PART A TO PART B   A12C(dev=-3.7 mm)   A12C  -3.7   A14D(dev=-4.1 mm)   A14D  -4.1
1    PART C TO PART B   A14C(dev=-1.8 mm)   A14C  -1.8                 NaN    NaN   NaN
2  PART Z-X TO PART C   A25C(dev=-6.2 mm)   A25C  -6.2                 NaN    NaN   NaN

In [19]: df1 = df.stack(level=1)

In [20]: df1
Out[20]:
                   junk number                part  size
0 0   A12C(dev=-3.7 mm)   A12C    PART A TO PART B  -3.7
  1   A14D(dev=-4.1 mm)   A14D                 NaN  -4.1
1 0   A14C(dev=-1.8 mm)   A14C    PART C TO PART B  -1.8
2 0   A25C(dev=-6.2 mm)   A25C  PART Z-X TO PART C  -6.2

这里有很多事情,关键是要刷你的正则表达式(如果你需要弄清楚这一点)。

据推测,你想要贬低&#34;垃圾&#34;柱!