Question

我有一些文件，根据Python csv模块的条款，似乎是非CSV文件。

然而，这种方法并没有完全不同，它们只有几个奇怪的属性，例如：

列以单个空格分隔
引用由四个空格字符完成
无法逃脱qouting

无论如何，似乎行是由常规语法生成的，所以我认为使用re模块解析它们可能是一个明智的选择。

我的代码已经将DictReader用于CSV文件，但我也需要阅读这些格式错误的文件。

有没有办法让Python的csv模块适应这些文件，或者我应该创建类似DictReader的自定义类，但不要继承csv中的任何内容。 1}}模块？

Answer 1

这比我想象的要复杂得多（至少对于正则表达式解决方案而言）。或许我可以想到一个更简单的问题。以下正则表达式似乎适用于所有情况：

reobj = re.compile(
    """(?:        # Start of non-capturing group:
     [^ \r\n]+    # Match one or more characters except space or newlines
    |             # or
     [ ]{4}       # match a quoted section, starting with four spaces,
     (?:          # then the following non-capturing group
      (?![ ]{4})  # (but only if no four spaces are at the current position)
      .           # which matches any character
     )*           # any number of times,
     [ ]{4}       # ending with four spaces.
    )+            # Repeat that as often as needed (at least one match)
    |             # OR: Match the empty space between two single spaces
    (?<![^ ])     # which means there mustn't be a non-space character before
    (?<!(?<![ ])[ ]{4}) # nor exactly spaces before the current position,
    (?= |$)       # but there must be a space or the end of the line after it""", 
    re.VERBOSE | re.DOTALL)

结果：

>>> reobj.findall("""foo bar baz    escape    bam foo    match with
... embedded newlines       crash boom  bang!""")
['foo', 'bar', 'baz    escape    bam', 'foo    match with\nembedded newlines    ',
 '', '', 'crash', 'boom', '', 'bang!']

见live on regex101.com。

阅读＆＃39; CSV＆＃39; Python的文件，实际上不是CSV方言

1 个答案: