Question

我正在尝试使用多个分隔符拆分字符串。我需要将分隔符保留为单词。我使用的分隔符是：所有标点符号和空格。

例如，字符串：

Je suis, FOU et toi ?!

应该产生：

'Je'
'suis'
','
'FOU'
'et'
'toi'
'?'
'!'

我写道：

class Parser :
    def __init__(self) :
        """Empty constructor"""

    def read(self, file_name) :
        from string import punctuation
        with open(file_name, 'r') as file :
            for line in file :
                for word in line.split() :
                    r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
                    print(r.split(word))

但我得到的结果是：

['Je']
['suis', '']
['FOU']
['et']
['toi']
['', '']

拆分似乎是正确的，但结果列表不包含分隔符:(

Answer 1

您需要将表达式放入re.split()的组中以保留它。我不会先在空白处分开;您以后可以随时删除仅限空格的字符串。如果您希望每个标点符号分开，那么您应该只在+空白组上使用\s量词：

# do this just once, not in a loop
pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))

# for each line
parts = [part for part in pattern.split(line) if part.strip()]

列表推导删除任何仅包含空格的内容：

>>> import re
>>> from string import punctuation
>>> line = 'Je suis, FOU et toi ?!'
>>> pattern = re.compile(r'(\s+|[{}])'.format(re.escape(punctuation)))
>>> pattern.split(line)
['Je', ' ', 'suis', ',', '', ' ', 'FOU', ' ', 'et', ' ', 'toi', ' ', '', '?', '', '!', '']
>>> [part for part in pattern.split(line) if part.strip()]
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']

您可以使用re.findall()查找所有单词或标点符号序列，而不是拆分：

pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))

parts = pattern.findall(line)

这样做的好处是您不需要过滤掉空格：

>>> pattern = re.compile(r'\w+|[{}]'.format(re.escape(punctuation)))
>>> pattern.findall(line)
['Je', 'suis', ',', 'FOU', 'et', 'toi', '?', '!']

Python拆分字符串并将分隔符保留为单词

1 个答案: