我在python中拆分一个字符串,我的目标是用逗号分隔除引号之外的逗号。我正在使用
fields = line.strip().split(",")
但是有些字符串如下所示:
10,20,"装置,机器",3,5
如何使用正则表达式来完成此操作?
答案 0 :(得分:2)
虽然我同意正则表达式可能不是这项工作的最佳工具,但我发现这个问题非常有趣。
import re
split_on_commas = re.compile(r'[^,]*".*"[^,]*|[^,]+|(?<=,)|^(?=,)').findall
此正则表达式按此顺序包含四个备用部分:
一些测试:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,20,"aaa, bbb",3,5') == ['10', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,,,20,"aaa, bbb",3,5') == ['10', '', '', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas(',10,20,"aaa, bbb",3,5') == ['', '10', '20', '"aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb",3,5,') == ['10', '20', '"aaa, bbb"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', '"aaa, bbb" ccc', '3', '5']
assert split_on_commas('10,20,ccc "aaa, bbb",3,5') == ['10', '20', 'ccc "aaa, bbb"', '3', '5']
assert split_on_commas('10,20,"aaa, bbb" "ccc",3,5,') == ['10', '20', '"aaa, bbb" "ccc"', '3', '5', '']
assert split_on_commas('10,20,"aaa, bbb" "ccc, ddd",3,5,') == ['10', '20', '"aaa, bbb" "ccc, ddd"', '3', '5', '']
assert split_on_commas('10,20,"aaa, "bbb",3,5') == ['10', '20', '"aaa, "bbb"', '3', '5']
assert split_on_commas('10,20,"",3,5') == ['10', '20', '""', '3', '5']
assert split_on_commas('10,20,",",3,5') == ['10', '20', '","', '3', '5']
assert split_on_commas(',,,') == ['', '', '', '']
assert split_on_commas('') == []
assert split_on_commas(',') == ['', '']
assert split_on_commas('","') == ['","']
assert split_on_commas('",') == ['"', '']
assert split_on_commas(',"') == ['', '"']
assert split_on_commas('"') == ['"']
csv
模块解决方案在SO上已经多次询问过类似的问题,每次最佳/接受的答案是#34;只需使用csv
模块&#34;。也许指出推荐的解决方案与我的re
命题之间的一些差异是有用的。但首先,使用与csv
相同的接口设计split
函数(不是惯用的,但与原始要求一致):
import csv
split_on_commas = lambda s: csv.reader([s]).next()
首先要注意的是csv.reader
不仅仅是智能split
。外部分隔符被抑制:
assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
这会导致一些奇怪的行为:
assert split_on_commas('10,20,"aaa, bbb" ccc,3,5') == ['10', '20', 'aaa, bbb ccc', '3', '5']
assert split_on_commas('10,20,aaa", bbb ccc",3,5') == ['10', '20', 'aaa"', ' bbb ccc"', '3', '5']
我确信生成的CSV不会出现问题,因为违规的双引号会被转义。
更令人震惊的是this module still does not support Unicode:
split_on_commas(u'10,20,"Juan, Chô",3,5')
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-83-a0ef82b5fc26> in <module>()
----> 1 split_on_commas(u'10,20,"Juan, Chô",3,5')
<ipython-input-81-18a2b4070348> in <lambda>(s)
1 if __name__ == "__main__":
2 import csv
----> 3 split_on_commas = lambda s: csv.reader([s]).next()
4
5 assert split_on_commas('10,20,"aaa, bbb",3,5') == ['10', '20', 'aaa, bbb', '3', '5']
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf4' in position 15: ordinal not in range(128)
但当然还有第三个不同之处:我的解决方案尚未经过全面测试, 保证无法在我没有想到的情况下工作......现在,既然如此这种方法似乎有几个真实的用例(例如,非TSV文件,非ASCII输入),如果一些正则表达式大师远非将其视为危险,可能会有助于找到了解其局限并加以改进。
答案 1 :(得分:0)
我就是这样做的:
import re
data = "my string \"string is nice\" other string "
print re.findall(r'(\w+|".*?")', data)
输出结果为:
['my', 'string', '"string is nice"', 'other', 'string']
我认为这里没有什么可以解释的,因为正则表达式本身就说明了一切。无论如何,如果您有任何疑问,我建议regex101
\ w + - 匹配任何字词[a-zA-Z0-9_]
&#34; - 字面匹配字符"
。*? - 匹配任何字符(换行符除外)
如果您还想摆脱方括号,请执行以下操作:
import re
string = "my string \"string is nice\" other string "
parsed_string = re.findall(r'(\w+|".*?")', string)
print(", ".join(parsed_string))
输出结果为:
my, string, "string is nice", other, string
答案 2 :(得分:0)
正如jonrsharpe和Alan Moore所说,Python的内置CSV module将是一个更好的解决方案。
根据他们自己的例子:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
答案 3 :(得分:-1)
正则表达式在这里效果不佳。
你可以用逗号分割,然后重新组合...... 或者按照评论中的建议使用csv模块......
line = '10,20,"Installations, machines",3,5'
fields = line.strip().split(",")
result = []
tmpfield = ''
for checkfield in fields:
tmpfield = checkfield if tmpfield=='' else tmpfield +','+ checkfield
if tmpfield.strip().startswith('"'):
if tmpfield.strip().endswith('"'):
result.append(tmpfield)
tmpfield = ''
else:
result.append(tmpfield)
tmpfield = ''
if tmpfield<>'':
result.append(tmpfield)
print(result)