Question

Python的新手。我正在使用csv阅读器来解析一些文件。我将解析使用3个不同分隔符的信息。逗号，竖线和空格（目前）。

我有这个：

    with open(filepath, "r") as fp:
            file_lines = fp.readlines()
            delimiter = re.search("\w+([^\w])", file_lines[0]).group(1)
            reader = csv.reader(file_lines, delimiter=delimiter)
            print('Delimiter: [{}]'.format(delimiter))
            line_list = [row for row in reader]
            print(line_list)

这适用于我的comma.txt文件。但是，当传入我的pipe.txt文件时，它是在捕获实际管道之前捕获空白。

带有管道的行的示例输入如下所示：

管道：Bouillon | Francis | G | M | Blue | 6-3-1975

空白：Bouillon Francis G M Blue 6-3-1975

逗号：Bouillon, Francis, G, M, Blue, 6-3-1975

你们会推荐另一种方法吗？还是应该改用我的正则表达式？

Answer 1

您可以尝试使用csv.sniffer类来确定要解析的csv的方言。

sniff()函数采用一串潜在的定界符，它将用来尝试确定如何解析文件。这很聪明，但是您的潜在分隔符包含空格，而|文件包含空格的事实是一个问题。如果用空格传递delimiters=',| '，它将把该空间标识为用|分隔的文件的分隔符。一种选择是尝试使用非空格定界符，如果失败，请尝试使用空格：

import csv
with open('test_space.csv') as csvfile:
    try:
        dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')
    except:
        csvfile.seek(0)
        dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=' ')
    dialect.skipinitialspace = True
    csvfile.seek(0)

   reader = csv.reader(csvfile, dialect)
    for line in reader:
        print(list(map(str.strip, line)))

这将正确地将这样的行标识为以空格分隔：

Bou|illon Francis G M Bl,ue 6-3-1975
Bouillon Francis G M Blue 6-3-1975
Bouillon Franc,is G M Blue 6-3-1975

使用正则表达式方法很难处理。

但是，如果在每行中都有潜在的分度符，它将与之匹配。例如，它将其解析为逗号分隔（我猜是因为它在每行中都看到一个逗号）：

Bou|illon Francis G M Bl,ue 6-3-1975
Bou,illon Francis G M Blue 6-3-1975
Bouillon Franc,is G M Blue 6-3-1975

Answer 2

正如我在评论中所说，正则表达式可以按预期工作。 ;）

带有Bouillon | Francis | G | M | Blue | 6-3-1975的

\w+([^\w])获得'Bouillon '作为group(0)（完全匹配），因为空格是第一个非单词字符。 ;）

如果您想在数据中保留填充空格，或者您的数据可能包含空格（例如Name Surname|Age），则无法在与搜索管道和逗号相同的正则表达式中搜索空格-因为该填充否则第一个值中的空格将被捕获。

（除非您在该正则表达式中搜索多个字符，但是您需要更复杂的代码，并且我喜欢简单性和可读性。））

您可以做的是：

搜索管道和逗号（假设用管道分隔的内容没有逗号，而用逗号分隔的内容没有管道）。仅当搜索失败时才使用空格。

search = re.search(r"[|,]", file_lines[0]) # add other delimeters in square brackets
# we don't have capturing groups, our full catch (group 0) is first character that matches possible delimeters
separator = search.group(0) if search else " " # is search was empty, assume space

另一种方法是按层次结构。
- 假设由管道分隔的文件内容中可以包含任何内容（包括逗号-与第一种方法相反-和空白）
- 假设逗号分隔的文件内容中只能包含管道
- ...
- 假设用空格分隔的文件没有任何字符用作分隔符
这时，检查将需要分层：首先检查管道是否存在。如果没有，请检查逗号。如果没有，请检查...如果没有，请使用空格。

这可以实现为简单的for循环，并且可能的分隔符可以是最重要的分隔符"|,"中的简单字符串。正则表达式对这样简单的事情不利。 ;）

possible_separators = "|,"
separator = " "
for sep in possible_separators:
    if sep in file_lines[0]:
        separator = sep
        break

Answer 3

从我的头顶上，我会喜欢的东西

([^\w-]|[|]|[,])

如果对此进行修整，您将获得定界符。看看RegExr来测试您的文件。它是JavaSript正则表达式，但我发现它对于调试Python正则表达式也很有用。

编辑

@ h4z3正确指出，您可以简化以下操作：

([^\w-]|[|,])

Answer 4

2 方法：

（您也可以不使用csv.reader进行操作，只需将sep除以尾随空格）

示例文件：

pipe.txt：

Bouillon | Francis | G | M | Blue | 6-3-1975
a | b | c | d | f | g

comma.txt：

Bouillon , Francis , G , M , Blue , 6-3-1975
a , b , c , d , f , g

space.txt

Bouillon   Francis   G   M   Blue   6-3-1975
a   b   c   d   f   g

import csv
from itertools import chain

with open('pipe.txt') as f:
    line = next(f).strip()   # extracting the 1st line
    sep = re.search(r'^\w+([\s\|,]+)', line).group(1)
    sep = ' ' if sep.isspace() else sep.strip()

    reader = csv.reader(chain(iter([line]), f), delimiter=sep, skipinitialspace=True)
    for row in reader:
        print(row)

输出（对于文件comma.txt和pipe.txt）：

['Bouillon ', 'Francis ', 'G ', 'M ', 'Blue ', '6-3-1975']
['a ', 'b ', 'c ', 'd ', 'f ', 'g']

with open('space.txt') as f:
...

由于space.txt功能，skipinitialspace=True的输出更加清晰：

['Bouillon', 'Francis', 'G', 'M', 'Blue', '6-3-1975']
['a', 'b', 'c', 'd', 'f', 'g']

或者没有csv.reader：

with open('comma.txt') as f:
    line = next(f).strip()
    sep = re.search(r'^\w+([\s\|,]+)', line).group(1)
    pat = re.compile(sep)

    for row in chain(iter([line]), f):
        print(pat.split(row.strip()))

输出：

['Bouillon', 'Francis', 'G', 'M', 'Blue', '6-3-1975']
['a', 'b', 'c', 'd', 'f', 'g']

享受！

使用正则表达式捕获管道，逗号和空格定界符

4 个答案: