用正则表达式剥离标点符号 - python

时间:2013-08-25 12:48:11

标签: python regex

我需要使用正则表达式来删除单词的 start end 中的标点符号。似乎正则表达式是最好的选择。我不希望从“你是”这样的单词中删除标点符号,这就是为什么我不使用.replace()。在此先感谢=)

4 个答案:

答案 0 :(得分:39)

您不需要正则表达式来执行此任务。将str.stripstring.punctuation

一起使用
>>> import string
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> '!Hello.'.strip(string.punctuation)
'Hello'

>>> ' '.join(word.strip(string.punctuation) for word in "Hello, world. I'm a boy, you're a girl.".split())
"Hello world I'm a boy you're a girl"

答案 1 :(得分:0)

您可以使用正则表达式从文本文件或特定的字符串文件中删除标点符号,如下所示-

new_data=[]
with open('/home/rahul/align.txt','r') as f:
    f1 = f.read()
    f2 = f1.split()



    all_words = f2 
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~''' 
    # You can add and remove punctuations as per your choice 
    #removing stop words in hungarian text and  english text and 
    #display the unpunctuated string
    # To remove from a string, replace new_data with new_str 
    # new_str = "My name$#@ is . rahul -~"

    for word in all_words: 
        if word not in punctuations:
           new_data.append(word)

    print (new_data)

P.S。 -按照要求正确进行识别。 希望这会有所帮助!

答案 2 :(得分:0)

我认为此功能将有助于消除标点符号:

import re
def remove_punct(text):
    new_words = []
    for word in text:
        w = re.sub(r'[^\w\s]','',word) #remove everything except words and space#how 
                                        #to remove underscore as well
        w = re.sub(r'\_','',w)
        new_words.append(w)
    return new_words

答案 3 :(得分:0)

如果你坚持使用正则表达式,我推荐这个解决方案:

import re
import string
p = re.compile("[" + re.escape(string.punctuation) + "]")
print(p.sub("", "\"hello world!\", he's told me."))
### hello world hes told me

另请注意,您可以传递自己的标点符号:

my_punct = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '.',
           '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', 
           '`', '{', '|', '}', '~', '»', '«', '“', '”']

punct_pattern = re.compile("[" + re.escape("".join(my_punct)) + "]")
re.sub(punct_pattern, "", "I've been vaccinated against *covid-19*!") # the "-" symbol should remain
### Ive been vaccinated against covid-19