Python使用单引号和双引号的各种组合拆分数据

时间:2019-01-31 12:53:00

标签: python python-2.7

我们正在尝试基于定界符(从配置文件传递)分割数据。我们面临着各种使用单引号和双引号的数据的问题。

样本输入数据为:

1|"100001111"|John Payne|100000060
2|'100002222'|John Payne|100000040
3|"100001111|John Payne|100000060
4|100002222"|John Payne|100000040
5|'100001111|John Payne|100000060
6|100002222'|John Payne|100000040
7,100001111,"John,Payne",100000060
8|'100002"222'|John Payne|100000040
9|"100002'222"|John Payne|100000040
10|"100002'222|John Payne|100000040
11|'100002"222|John Payne|100000040
12|100002'222"|John Payne|100000040
13|100002"222'|John Payne|100000040
14,100001111,'John,Payne',100000060

我们已经尝试过以下正则表达式选项,但并非所有情况都适用。

re.split('''[,|](?=(?:[^'"]|'[^']*'|"[^"]*")*$)''' , data)
re.split(r'[ ,|;"]+' , data)

输入

8|'100002"222'|John Payne|100000040

输出

['8' , "'100002"222'" , 'John Payne' , '100000040']

1 个答案:

答案 0 :(得分:1)

创造性地使用csv模块来使每个解析的行具有不同的定界符可以解决问题。但是,它并不完美。没有匹配的结束引号的单引号的行看起来很棘手。

import csv
import io

input_data = """
1|"100001111"|John Payne|100000060
2|'100002222'|John Payne|100000040
3|"100001111|John Payne|100000060
4|100002222"|John Payne|100000040
5|'100001111|John Payne|100000060
6|100002222'|John Payne|100000040
7,100001111,"John,Payne",100000060
8|'100002"222'|John Payne|100000040
9|"100002'222"|John Payne|100000040
10|"100002'222|John Payne|100000040
11|'100002"222|John Payne|100000040
12|100002'222"|John Payne|100000040
13|100002"222'|John Payne|100000040
14,100001111,'John,Payne',100000060
""".strip()

parsed_data = []

for line in input_data.splitlines():
    sep = ('|' if '|' in line else ',')
    reader = csv.reader(io.StringIO(line), delimiter=sep)
    parsed_line = next(reader)
    parsed_data.append(parsed_line)
    print(parsed_line)

输出

['1', '100001111', 'John Payne', '100000060']
['2', "'100002222'", 'John Payne', '100000040']
['3', '100001111|John Payne|100000060']
['4', '100002222"', 'John Payne', '100000040']
['5', "'100001111", 'John Payne', '100000060']
['6', "100002222'", 'John Payne', '100000040']
['7', '100001111', 'John,Payne', '100000060']
['8', '\'100002"222\'', 'John Payne', '100000040']
['9', "100002'222", 'John Payne', '100000040']
['10', "100002'222|John Payne|100000040"]
['11', '\'100002"222', 'John Payne', '100000040']
['12', '100002\'222"', 'John Payne', '100000040']
['13', '100002"222\'', 'John Payne', '100000040']
['14', '100001111', "'John", "Payne'", '100000060']