想要使用python regex提取具有某些特殊字符的字母数字文本

时间:2019-03-27 12:07:33

标签: python regex python-3.x special-characters

我有以下想要使用python regex以所需格式显示的文本

text = "' PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'"

我使用了以下代码

reg = re.compile("[^\w']")
text = reg.sub(' ', text)

但是它以文本= "'PowerPoint PresentationOctober 11th 2011 Visit to Lap Chec1Edit or delete me in â viewâ then â slide masterâ'"的形式给出输出,这不是期望的输出。

我想要的输出应该是text = '"PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in view then slide master.'" 我要删除除[]()-,.

之后的特殊字符

2 个答案:

答案 0 :(得分:1)

您可以使用正确的编码来修复字符,而不是删除字符:

text = text.encode('windows-1252').decode('utf-8')
// => ' PowerPoint PresentationOctober 11th, 2011Visit to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'

请参见Python demo

如果以后要删除它们,它将变得更加容易,例如text.replace('‘', '').replace('’', '')re.sub(r'[’‘]+', '', text)

答案 1 :(得分:-1)

尽管很简单,但我得到了答案,谢谢您的答复。

reg = re.compile("[^\w'\,\.\(\)\[\]]")
text = reg.sub(' ', text)