我有一些文件,根据Python csv
模块的条款,似乎是非CSV文件。
然而,这种方法并没有完全不同,它们只有几个奇怪的属性,例如:
无论如何,似乎行是由常规语法生成的,所以我认为使用re
模块解析它们可能是一个明智的选择。
我的代码已经将DictReader
用于CSV文件,但我也需要阅读这些格式错误的文件。
有没有办法让Python的csv
模块适应这些文件,或者我应该创建类似DictReader
的自定义类,但不要继承csv
中的任何内容。 1}}模块?
答案 0 :(得分:0)
这比我想象的要复杂得多(至少对于正则表达式解决方案而言)。或许我可以想到一个更简单的问题。以下正则表达式似乎适用于所有情况:
reobj = re.compile(
"""(?: # Start of non-capturing group:
[^ \r\n]+ # Match one or more characters except space or newlines
| # or
[ ]{4} # match a quoted section, starting with four spaces,
(?: # then the following non-capturing group
(?![ ]{4}) # (but only if no four spaces are at the current position)
. # which matches any character
)* # any number of times,
[ ]{4} # ending with four spaces.
)+ # Repeat that as often as needed (at least one match)
| # OR: Match the empty space between two single spaces
(?<![^ ]) # which means there mustn't be a non-space character before
(?<!(?<![ ])[ ]{4}) # nor exactly spaces before the current position,
(?= |$) # but there must be a space or the end of the line after it""",
re.VERBOSE | re.DOTALL)
结果:
>>> reobj.findall("""foo bar baz escape bam foo match with
... embedded newlines crash boom bang!""")
['foo', 'bar', 'baz escape bam', 'foo match with\nembedded newlines ',
'', '', 'crash', 'boom', '', 'bang!']