通过定界符分割时,使引用的块保持完整

时间:2018-11-20 11:13:31

标签: python python-3.x split

给出示例字符串s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"',我想将其分派为以下块:

# To Do: something like {l = s.split(',')}
l = ['Hi', 'my name is Humpty-Dumpty', '"Alice, Through the Looking Glass"']

我不知道在哪里找到多少个分隔符。

这是我最初的想法,它很长,而且不准确,因为它删除了所有定界符,而我希望引号内的定界符能够继续存在:

s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
ss = []
inner_string = ""
delimiter = ','

for item in s.split(delimiter):
    if not inner_string: 
        if '\"' not in item: # regullar string. not intersting
            ss.append(item)
        else:
            inner_string += item # start inner string

    elif inner_string:
        inner_string += item

        if '\"' in item:  # end inner string
            ss.append(inner_string)
            inner_string = ""
        else:            # middle of inner string
            pass

print(ss)
# prints ['Hi', ' my name is Humpty-Dumpty', ' from "Alice Through the Looking Glass"'] which is OK-ish

3 个答案:

答案 0 :(得分:2)

您可以使用re.split按正则表达式进行拆分:

>>> import re
>>> [x for x in re.split(r'([^",]*(?:"[^"]*"[^",]*)*)', s) if x not in (',','')]

s等于:

'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'

它输出:

['Hi', ' my name is Humpty-Dumpty', ' from "Alice, Through the Looking Glass"']

正则表达式说明:

(
    [^",]*          zero or more chars other than " or ,
    (?:             non-capturing group
        "[^"]*"     quoted block
        [^",]*      followed by zero or more chars other than " or ,
    )*              zero or more times
)

答案 1 :(得分:1)

我通过完全避免使用split来解决了这个问题:

s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'
l = []
substr = ""
quotes_open = False

for c in s:
    if c == ',' and not quotes_open: # check for comma only if no quotes open
        l.append(substr)
        substr = ""
    elif c == '\"':
        quotes_open = not quotes_open
    else:
        substr += c

l.append(substr)

print(l)

输出:

['Hi', ' my name is Humpty-Dumpty', ' from Alice, Through the Looking Glass']

更通用的功能可能类似于:

def custom_split(input_str, delimiter=' ', avoid_between_char='\"'):
    l = []
    substr = ""
    between_avoid_chars = False
    for c in s:
        if c == delimiter and not between_avoid_chars:
            l.append(substr)
            substr = ""
        elif c == avoid_between_char:
            between_avoid_chars = not between_avoid_chars
        else:
            substr += c
    l.append(substr)
    return l

答案 2 :(得分:0)

这将适用于这种特定情况,并且可以提供一个起点。

import re
s = 'Hi, my name is Humpty-Dumpty, from "Alice, Through the Looking Glass"'

cut = re.search('(".*")', s)

r = re.sub('(".*")', '$VAR$', s).split(',')
res = []
for i in r:
    res.append(re.sub('\$VAR\$', cut.group(1), i))

输出

print(res)
['Hi', ' my name is Humpty-Dumpty', ' from "Alice, Through the Looking Glass"']