Question

我有这个代码将复杂的CSV文件拆分成块。困难的是，逗号也可能出现在＆＃34;＆＃34;因此，不得拆分。 RegEx我用来查找不在＆＃34;＆＃34;内的逗号。工作正常：

comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

演示：here

import re

test = 'Test1,Test2,"",Test3,Test4"",Test5'
comma_re = re.compile(r',(?=([^"]*""[^"]*"")*[^"]*$)')

print comma_re.split(test)

输出：

['Test1', 'Test2,"",Test3,Test4""', 'Test2', '"",Test3,Test4""', '"",Test3,Test4""', None, 'Test5']

渴望：

['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

如何避免无用的拆分结果？

编辑：我甚至不知道默认的CSV模块，继续使用它。谢谢你的努力！

Answer 1

(?<!"),(?![^",]+")|,(?=[^"]*$)

如果输入与该格式不同，它将适用于您提供的示例，但它不会起作用。

input = 'Test1,Test2,"",Test3,Test4"",Test5'
output = re.split(r'(?<!"),(?![^",]+")|,(?=[^"]*$)', input)
print(output)

# ['Test1', 'Test2', '"",Test3,Test4""', 'Test5']

Python demo

您应该真正使用CSV解析器。如果由于某种原因你不能做 - 只需做一些手工字符串处理，逐个字符并在看到逗号时拆分，除非你已经认识到你是一个带引号的字符串。如下所示：

input = 'Test1,Test2,"",Test3,Test4"",Test5'

insideQuoted = False
output = []
lastIndex = 0

for i in range(0, len(input)):
    if input[i] == ',' and not insideQuoted:
        output.append(input[lastIndex: i])
        lastIndex = i + 1
    elif input[i] == '"' and i < len(input) - 1 and input[i + 1] == '"':
        insideQuoted ^= True
    elif i == len(input) - 1:
        output.append(input[lastIndex: i + 1])

Demo

避免＆＃34;其余的字符串＆＃34;拆分结果

1 个答案: