在标点符号的特定列表之前和之后去除空格

时间:2018-09-07 18:13:30

标签: python

尽管我在StackOverflow中找到了一些引用,但是我无法编写正确的正则表达式来实现我的目标。我想从python中的字符串中删除特定标点符号之前和之后的空格。

我的功能如下。

def modify_answers(answers):
    hyp = []
    for ans in answers:
        # remove whitespace before - / ? . ! ;
        newhyp = re.sub(r'\s([-/?.!,;](?:\s|$))', r'\1', ans)
        # remove whitespace after - / $ _
        newhyp = re.sub(r'', r'\1', newhyp)
        hyp.append(newhyp)
    return hyp

一些我想实现的例子:

  • “税号是1-866-704-7388。” --->“税号是1-866-704-7388。”

  • “不,e在维多利亚州不受保护。” --->“不,e在维多利亚州不受保护。”

  • “发现会因结构而失去_ _ _ _ _ _ _。” --->“发现是失去的,就像构造是______。”

  • “ $ 1,0等于$ 1,0。” --->“ $ 1,0等于$ 1,0。”

任何帮助将不胜感激。

3 个答案:

答案 0 :(得分:4)

首先,定义一个执行替换的函数:

import re

def replace(x):
    y, z = x.groups()
    if z in '-/?.!,;':
        y = y.lstrip()
    if z in '-/$_':
        y = y.rstrip()
    return y

该函数采用匹配模式并相应地执行替换。

现在,定义您的图案。您可以预编译以提高效率。

p = re.compile(r'(\s*([-/?.,!$_])\s*)')

使用前面定义的回调在每个字符串上调用已编译的正则表达式sub

cases = [                               
    "Tax pin number is 1 - 866 - 704 - 7388 .",
    "No , emu is not protected in Victoria .",
    "Find is to lose as construct is to _ _ _ _ _ _ .",
    "$ 1,0 is equal to $ 1,0 ."]

repl = [p.sub(replace, c) for c in cases]

print (repl)
['Tax pin number is 1-866-704-7388.', 'No, emu is not protected in Victoria.', 
 'Find is to lose as construct is to ______.', '$1,0 is equal to $1,0.']

答案 1 :(得分:3)

您可以这样做:

import re

sentences = ["Tax pin number is 1 - 866 - 704 - 7388 .",
             "No , emu is not protected in Victoria .",
             "Find is to lose as construct is to _ _ _ _ _ _ .",
             "$ 1,0 is equal to $ 1,0 ."]


def modify_answers(answers):
    hyp = []
    for ans in answers:
        # remove whitespace before - / ? . ! ;
        new_hyp = re.sub(r'\s([/?.!;_-])(\s|$)', r'\1', ans)
        new_hyp = re.sub(r'\s(,)(\s|$)', r'\1 ', new_hyp)
        new_hyp = re.sub(r'(^|\s)(\$)(\s|$)', r' \2', new_hyp)
        hyp.append(new_hyp.strip())
    return hyp

for sentence in modify_answers(sentences):
    print(sentence)

输出

Tax pin number is 1-866-704-7388.
No, emu is not protected in Victoria.
Find is to lose as construct is to______.
$1,0 is equal to $1,0.

注释

  • 第一个正则表达式仅用符号代替由空格包围的/?.!;_-中的任何一个。 -符号表示[]内的范围,因此必须将其放在末尾。
  • 第二个正则表达式用,(用逗号后跟一个空格)代替,,并用空白包围
  • 第三个正则表达式用$(由空格括起来的美元符号)代替由空格包围的$。在此正则表达式中,您必须引用第二组。

答案 2 :(得分:3)

使用r' (?=[-/?.!])|(?<=[-/$_]) '用空字符串替换模式re.sub

>>> lst = ["Tax pin number is 1 - 866 - 704 - 7388 .",
...              "No , emu is not protected in Victoria .",
...              "Find is to lose as construct is to _ _ _ _ _ _ .",
...              "$ 1,0 is equal to $ 1,0 ."]
>>> 
>>> def modify_answers(answers):
...     ptrn = re.compile(r' (?=[-/?.!])|(?<=[-/$_]) ')
...     return [ptrn.sub('', answer) for answer in answers]
... 
>>> 
>>> pprint(modify_answers(lst))
['Tax pin number is 1-866-704-7388.',
 'No , emu is not protected in Victoria.',
 'Find is to lose as construct is to ______.',
 '$1,0 is equal to $1,0.']