我在pandas dataframe df.spec中有一个列,其中包含以下形式的字符串(例如三行):
'PART A TO PART B - 2 features out of tolerance: A12C(dev=-3.7 mm) A14D(dev=-4.1 mm)'
'PART C TO PART B - 1 feature out of tolerance: A14C(dev=-1.8 mm)'
'PART Z-X TO PART C - 1 feature out of tolerance: A25C(dev=-6.2 mm)'
我希望能够使用如下形式将数据解析为数据框:
AREA | POINT | MEASUREMENT
PART A TO PART B | A12C | -3.7
PART A TO PART B | A12D | -4.1
PART C TO PART B | A14C | -1.8
PART Z-X TO PART C | A25C | -6.2
有人可以帮我向我解释如何实现这个目标吗?
答案 0 :(得分:0)
我会使用str.extract
:
注意:如果您知道一行中最多有N个项目,则可以将OPTION(ROWCOUNT=x)
替换为range(1, 2)
。
range(1, N)
最后,我们填写那些大于0的部分:
In [11]: s
Out[11]:
0 PART A TO PART B - 2 features out of tolerance...
1 PART C TO PART B - 1 feature out of tolerance:...
2 PART Z-X TO PART C - 1 feature out of toleranc...
dtype: object
In [12]: def chunk(i):
...: return r'(?P<junk_{}>\s(?P<number_{}>.*?)\(dev=(?P<size_{}>-?[\.0-9]+) mm\))'.format(i, i, i)
...:
In [13]: df = s.str.extract("(?P<part>.*?)\s-.*?:{}?.*?".format(chunk(0) + "?".join((chunk(i) for i in range(1, 2)) )), expand=True)
In [14]: df
Out[14]:
part junk_0 number_0 size_0 junk_1 number_1 size_1
0 PART A TO PART B A12C(dev=-3.7 mm) A12C -3.7 A14D(dev=-4.1 mm) A14D -4.1
1 PART C TO PART B A14C(dev=-1.8 mm) A14C -1.8 NaN NaN NaN
2 PART Z-X TO PART C A25C(dev=-6.2 mm) A25C -6.2 NaN NaN NaN
In [15]: df = s.str.extract("(?P<part_0>.*?)\s-.*?:{}?.*?".format(chunk(0) + "?".join((chunk(i) for i in range(1, 2)) )), expand=True)
In [16]: df
Out[16]:
part_0 junk_0 number_0 size_0 junk_1 number_1 size_1
0 PART A TO PART B A12C(dev=-3.7 mm) A12C -3.7 A14D(dev=-4.1 mm) A14D -4.1
1 PART C TO PART B A14C(dev=-1.8 mm) A14C -1.8 NaN NaN NaN
2 PART Z-X TO PART C A25C(dev=-6.2 mm) A25C -6.2 NaN NaN NaN
In [17]: df.columns = pd.MultiIndex.from_tuples(df.columns.map(lambda x: tuple(x.split("_"))))
In [18]: df
Out[18]:
part junk number size junk number size
0 0 0 0 1 1 1
0 PART A TO PART B A12C(dev=-3.7 mm) A12C -3.7 A14D(dev=-4.1 mm) A14D -4.1
1 PART C TO PART B A14C(dev=-1.8 mm) A14C -1.8 NaN NaN NaN
2 PART Z-X TO PART C A25C(dev=-6.2 mm) A25C -6.2 NaN NaN NaN
In [19]: df1 = df.stack(level=1)
In [20]: df1
Out[20]:
junk number part size
0 0 A12C(dev=-3.7 mm) A12C PART A TO PART B -3.7
1 A14D(dev=-4.1 mm) A14D NaN -4.1
1 0 A14C(dev=-1.8 mm) A14C PART C TO PART B -1.8
2 0 A25C(dev=-6.2 mm) A25C PART Z-X TO PART C -6.2
这里有很多事情,关键是要刷你的正则表达式(如果你需要弄清楚这一点)。
据推测,你想要贬低&#34;垃圾&#34;柱!