使用自定义定界符分割字符串,尊重并保留引号(单引号或双引号)

时间:2019-06-27 13:06:10

标签: python regex

我有一个像这样的字符串:

>>> s = '1,",2, ",,4,,,\',7, \',8,,10,'
>>> s
'1,",2, ",,4,,,\',7, \',8,,10,'

我想使用不同的定界符(not just white spaces)对其进行拆分,并且我还希望尊重和保留引号(单引号或双引号)。

在定界符s上分割,时的预期结果:

['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']

2 个答案:

答案 0 :(得分:2)

您似乎正在重新发明python模块csv。包括电池。

In [1]: import csv
In [2]: s = '1,",2, ",,4,,,\',7, \',8,,10,'
In [3]: next(csv.reader([s]))
Out[3]: ['1', ',2, ', '', '4', '', '', "'", '7', " '", '8', '', '10', '']

我认为,正则表达式通常不是很好的解决方案。在意外的时刻,它可能会出奇地慢。在csv模块中,可以调整方言,并且很容易处理字符串/文件的任何数字。

我无法同时将csv调整为quotechar的两个变体,但是您真的需要吗?

In [4]: next(csv.reader([s], quotechar="'"))
Out[4]: ['1', '"', '2', ' "', '', '4', '', '', ',7, ', '8', '', '10', '']

In [5]: s = '1,",2, ",,4,,,",7, ",8,,10,'
In [6]: next(csv.reader([s]))
Out[6]: ['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']

答案 1 :(得分:0)

this的修改版本(仅处理空白)可以解决问题(引号被删除):

>>> import re
>>> s = '1,",2, ",,4,,,\',7, \',8,,10,'

>>> tokens = [t for t in re.split(r",?\"(.*?)\",?|,?'(.*?)',?|,", s) if t is not None ]
>>> tokens
['1', ',2, ', '', '4', '', '', ',7, ', '8', '', '10', '']

如果您想保留引号,请输入以下字符:

>>> tokens = [t for t in re.split(r",?(\".*?\"),?|,?('.*?'),?|,", s) if t is not None ]
>>> tokens
['1', '",2, "', '', '4', '', '', "',7, '", '8', '', '10', '']

如果要使用自定义分隔符,请用自己的分隔符替换正则表达式中每次出现的,

说明

| = match alternatives e.g. ( |X) = space or X
.* = anything
x? = x or nothing
() = capture the content of a matched pattern

We have 3 alternatives:

1 "text"    -> ".*?" -> due to escaping rules becomes - > \".*?\"
2 'text'    -> '.*?'
3 delimiter ->  ,

Since we want to capture the content of the text inside the quotes, we use ():

1 \"(.*?)\"   (to keep the quotes use (\".*?\")
2 '(.*?)'     (to keep the quotes use ('.*?')

Finally we don't want that split function reports an empty match if a
delimiter precedes and follows quotes, so we capture that possible
delimiter too:

1 ,?\"(.*?)\",?
2 ,?'(.*?)',?

Once we use the | operator to join the 3 possibilities we get this regexp:

r",?\"(.*?)\",?|,?'(.*?)',?|,"