Question

我正在尝试处理一个大管道“|”分隔，双引号限定文本文件（＆gt; 700,000条记录，每条记录> 3,000个字符，每条记录28个字段）。使用python脚本。我遇到了一个问题，因为csv解析器正在解析字段，因为未转义的双引号字符和嵌入文件中字段文本的管道。由于文件中不存在选项卡，我想通过用tabs（\ t）替换double-quote-pipe-double-quote定界符/限定符字符序列（“|”）将其转换为制表符分隔文件。如果每个字段都填充但有些字段不填充，这将相对简单。未填充的字段由空字符串表示，因此我可以按照双引号开头的顺序包含1到7个管道分隔符。

一个简单的例子是：

"abc"|"2016-07-30"|"text narrative field"|"2016-08-01"|"123"|"456"|"789"|"EOR"

更具代表性的例子是：

"abc"|"2017-01-01"|"height: 5' 7" (~180 cm) | weight: 80kg | in good health"|"2016-01-10"||||"EOR"

我一直在尝试编写一个正则表达式，它将替换每个管道字符/双引号组合或管道字符序列，紧跟在前面和之后是带有TAB字符的双引号1。我找到了许多正则表达式示例，用于替换具有单个字符的重复字符串，但没有一个用替换字符的等长字符串替换一系列重复字符。

我尝试了以下正则表达式："\|{1,}"，它适用于单个管道char，但是使用单个TAB依次替换多个管道。我还需要处理以下相关方面：

删除行/双引号（^“）
删除双引号/行尾（“$）
并使用相同数量的TAB字符替换双引号/管道（1或更多）/行尾（例如“\ | $”），因为有管道符号

应用正则表达式后的

输出记录如下所示，使用\ t来表示TAB字符：

abc\t2016-07-30\ttext narrative field\t2016-08-01\t123\t456\t789\tEOR
abc\t2017-01-01\theight: 5' 7" (~180 cm) | weight: 80kg | in good health\t2016-01-10\t\t\t\tEOR

我愿意使用sed或awk

在python或linux中解决这个问题

Answer 1

既然您正在寻找将"|"多个||替换为|""|的答案，那么{/ 1}}

怎么样：

while True:
    new_data = re.sub(r'\|\|', '|""|', data)
    if data == new_data:
        break
    data = new_data

然后，您可以使用标签替换"|"。

Answer 2

你可以在3次传球中完成。

将所有||替换为|""|
拆分"|"（和|两端）
从每个字段中删除引号。

如下：

import re

for line in file:
    while '||' in line:
        line = line.replace('||', '|""|')

    fields = re.split('^\||\|$|"\|"', line)

    new_line = '\t'.join([field.strip('"') for field in fields])

Answer 3

import re

def count_pipes_in_regex_match(m):
  #  regex capture group should only contain pipe chars
  matched_pipes = m.groups()[0]

  return '\t' * len(matched_pipes)


# test string
s='"abc"|"2017-01-01"|"height: 5\' 7" (~180 cm) | weight: 80kg | in good health"|"2016-01-10"||||"EOR"'


# replace leading or trailing quotes
s = re.sub('^"|"$', '', s)

# replace quote pipe(s) quote 
# or      quote pipe(s) end-of-string
# with as many tabs as there were pipes
s = re.sub('"(\|+)("|$)', count_pipes_in_regex_match, s)

print repr(s) #repr to show the tabs

尝试online at repl.it

使用备用字符

3 个答案: