我们正在尝试基于定界符(从配置文件传递)分割数据。我们面临着各种使用单引号和双引号的数据的问题。
样本输入数据为:
1|"100001111"|John Payne|100000060
2|'100002222'|John Payne|100000040
3|"100001111|John Payne|100000060
4|100002222"|John Payne|100000040
5|'100001111|John Payne|100000060
6|100002222'|John Payne|100000040
7,100001111,"John,Payne",100000060
8|'100002"222'|John Payne|100000040
9|"100002'222"|John Payne|100000040
10|"100002'222|John Payne|100000040
11|'100002"222|John Payne|100000040
12|100002'222"|John Payne|100000040
13|100002"222'|John Payne|100000040
14,100001111,'John,Payne',100000060
我们已经尝试过以下正则表达式选项,但并非所有情况都适用。
re.split('''[,|](?=(?:[^'"]|'[^']*'|"[^"]*")*$)''' , data)
re.split(r'[ ,|;"]+' , data)
输入
8|'100002"222'|John Payne|100000040
输出
['8' , "'100002"222'" , 'John Payne' , '100000040']
答案 0 :(得分:1)
创造性地使用csv
模块来使每个解析的行具有不同的定界符可以解决问题。但是,它并不完美。没有匹配的结束引号的单引号的行看起来很棘手。
import csv
import io
input_data = """
1|"100001111"|John Payne|100000060
2|'100002222'|John Payne|100000040
3|"100001111|John Payne|100000060
4|100002222"|John Payne|100000040
5|'100001111|John Payne|100000060
6|100002222'|John Payne|100000040
7,100001111,"John,Payne",100000060
8|'100002"222'|John Payne|100000040
9|"100002'222"|John Payne|100000040
10|"100002'222|John Payne|100000040
11|'100002"222|John Payne|100000040
12|100002'222"|John Payne|100000040
13|100002"222'|John Payne|100000040
14,100001111,'John,Payne',100000060
""".strip()
parsed_data = []
for line in input_data.splitlines():
sep = ('|' if '|' in line else ',')
reader = csv.reader(io.StringIO(line), delimiter=sep)
parsed_line = next(reader)
parsed_data.append(parsed_line)
print(parsed_line)
输出
['1', '100001111', 'John Payne', '100000060']
['2', "'100002222'", 'John Payne', '100000040']
['3', '100001111|John Payne|100000060']
['4', '100002222"', 'John Payne', '100000040']
['5', "'100001111", 'John Payne', '100000060']
['6', "100002222'", 'John Payne', '100000040']
['7', '100001111', 'John,Payne', '100000060']
['8', '\'100002"222\'', 'John Payne', '100000040']
['9', "100002'222", 'John Payne', '100000040']
['10', "100002'222|John Payne|100000040"]
['11', '\'100002"222', 'John Payne', '100000040']
['12', '100002\'222"', 'John Payne', '100000040']
['13', '100002"222\'', 'John Payne', '100000040']
['14', '100001111', "'John", "Payne'", '100000060']