Python使用可能的集解析CSV字符串

时间:2018-04-11 08:25:59

标签: python csv parsing

我有一个CSV字符串,其中一些项目可能被{}括起来,里面有逗号。我想在列表中收集字符串值。

在列表中收集值的最pythonic方式是什么?

示例1: 'a,b,c',预期输出['a', 'b', 'c']

示例2: '{aa,ab}, b, c',预期输出['{aa,ab}','b','c']

示例3: '{aa,ab}, {bb,b}, c',预期输出['{aa,ab}', '{bb,b}', 'c']

我曾尝试使用s.split(','),它适用于示例1,但会让案例2和3陷入困境。

我相信这个问题(How to split but ignore separators in quoted strings, in python?)与我的问题非常相似。但我无法弄清楚要使用的正确的正则表达式语法。

4 个答案:

答案 0 :(得分:6)

解决方案实际上非常相似:

import re
PATTERN = re.compile(r'''\s*((?:[^,{]|\{[^{]*\})+)\s*''')
data = '{aa,ab}, {bb,b}, c'
print(PATTERN.split(data)[1::2])

会给:

['{aa,ab}', '{bb,b}', 'c']

答案 1 :(得分:3)

一种更易读的方式(至少对我而言)是解释你要找的东西:括号{}之间的东西或只包含字母数字字符的东西:

import re 

examples = [
  'a,b,c',
  '{aa,ab}, b, c',
  '{aa,ab}, {bb,b}, c'
]

for example in examples:
  print(re.findall(r'(\{.+?\}|\w+)', example))

打印

['a', 'b', 'c']
['{aa,ab}', 'b', 'c']
['{aa,ab}', '{bb,b}', 'c']

答案 2 :(得分:1)

请注意,没有必要使用正则表达式,您只需使用纯Python:

s = '{aa,ab}, {bb,b}, c'
commas = [i for i, c in enumerate(s) if c == ',' and \
                                             s[:i].count('{') == s[:i].count('}')]
[s[2:b] for a, b in zip([-2] + commas, commas + [None])]
#['{aa,ab}', '{bb,b}', 'c']

答案 3 :(得分:0)

一种更简单的纯python方法,将{}替换为“”:

def parseCSV(string):

    results = []
    current = ''
    quoted = False
    quoting = False


    for i in range(0, len(string)):
        currentletter = string[i]

        if currentletter == '"':
            if quoted == True:
                if quoting == True:
                    current = current + currentletter
                    quoting = False 
                else:
                    quoting = True

            else:
                quoted = True
                quoting = False

        else:

            shouldCheck  = False

            if quoted == True:

                if quoting == True:
                    quoted = False
                    quoting = False

                    shouldCheck = True

                else:
                    current = current + currentletter

            else:
                shouldCheck = True

            if shouldCheck == True:
                if currentletter == ',':
                    results.append(current)
                    current = ''

                else:
                    current = current +  currentletter

    results.append(current)
    return results