正则表达式解决方案

Question

我有一个带有字符串列的csv文件。字符串的一部分在括号中。我希望将括号中的字符串部分移动到不同的列，并保留字符串的其余部分。

例如：我希望转换：

LC(Carbamidomethyl)RLK

到

LCRLK Carbamidomethyl

Answer 1

正则表达式解决方案

如果字符串中只有一个括号组，则可以使用此正则表达式：

>>> a = "LC(Carbamidomethyl)RLK"
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK Carbamidomethyl'
>>> a = "LCRLK"  
>>> re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)
'LCRLK'  # works with no parentheses too

正则表达式分解：

(.*)       #! Capture begin of the string
\(         # match first parenthesis
  (.+)     #! Capture content into parentheses
\)         # match the second
(.*)       #! Capture everything after

---------------
\g<1>\g<3> \g<2>  # Write each capture in the correct order

字符串操作解决方案

对于庞大的数据集，更快的解决方案是：

begin, end  = a.find('('), a.find(')')
if begin != -1 and end != -1: 
    a = a[:begin] + a[end+1:] + " " + a[begin+1:end]

过程是获取括号的位置（如果有的话）并将字符串剪切到我们想要的位置。然后，我们连接结果。

每种方法的表现

很明显，字符串操作是最快的方法：

>>> timeit.timeit("re.sub('(.*)\((.+)\)(.*)', '\g<1>\g<3> \g<2>', a)", setup="a = 'LC(Carbadidomethyl)RLK'; import re")
15.214869976043701


>>> timeit.timeit("begin, end  = a.find('('), a.find(')') ; b = a[:begin] + a[end+1:] + ' ' + a[begin+1:end]", setup="a = 'LC(Carbamidomethyl)RL'")
1.44008207321167

多个括号设置

见评论

>>> a = "DRC(Carbamidomethyl)KPVNTFVHESLADVQAVC(Carbamidomethyl)SQKNVACK"
>>> while True:
...     begin, end  = a.find('('), a.find(')')
...     if begin != -1 and end != -1:
...         a = a[:begin] + a[end+1:] + " " + a[begin+1:end]
...     else:
...         break
...
>>> a
'DRCKPVNTFVHESLADVQAVCSQKNVACK Carbamidomethyl Carbamidomethyl'

使用python在括号中提取字符串的一部分

1 个答案:

正则表达式解决方案

字符串操作解决方案

每种方法的表现

多个括号设置