阅读' CSV' Python的文件,实际上不是CSV方言

时间:2014-05-30 10:09:55

标签: python regex csv

我有一些文件,根据Python csv模块的条款,似乎是非CSV文件。

然而,这种方法并没有完全不同,它们只有几个奇怪的属性,例如:

  • 列以单个空格分隔
  • 引用由四个空格字符完成
  • 无法逃脱qouting

无论如何,似乎行是由常规语法生成的,所以我认为使用re模块解析它们可能是一个明智的选择。

我的代码已经将DictReader用于CSV文件,但我也需要阅读这些格式错误的文件。

有没有办法让Python的csv模块适应这些文件,或者我应该创建类似DictReader的自定义类,但不要继承csv中的任何内容。 1}}模块?

1 个答案:

答案 0 :(得分:0)

这比我想象的要复杂得多(至少对于正则表达式解决方案而言)。或许我可以想到一个更简单的问题。以下正则表达式似乎适用于所有情况:

reobj = re.compile(
    """(?:        # Start of non-capturing group:
     [^ \r\n]+    # Match one or more characters except space or newlines
    |             # or
     [ ]{4}       # match a quoted section, starting with four spaces,
     (?:          # then the following non-capturing group
      (?![ ]{4})  # (but only if no four spaces are at the current position)
      .           # which matches any character
     )*           # any number of times,
     [ ]{4}       # ending with four spaces.
    )+            # Repeat that as often as needed (at least one match)
    |             # OR: Match the empty space between two single spaces
    (?<![^ ])     # which means there mustn't be a non-space character before
    (?<!(?<![ ])[ ]{4}) # nor exactly spaces before the current position,
    (?= |$)       # but there must be a space or the end of the line after it""", 
    re.VERBOSE | re.DOTALL)

结果:

>>> reobj.findall("""foo bar baz    escape    bam foo    match with
... embedded newlines       crash boom  bang!""")
['foo', 'bar', 'baz    escape    bam', 'foo    match with\nembedded newlines    ',
 '', '', 'crash', 'boom', '', 'bang!']

live on regex101.com