Python str.maketrans删除带有空白的标点

时间:2018-10-05 14:42:51

标签: python string text data-cleaning

我正在使用Python 3中 string模块中的 maketrans 来进行简单的文本预处理,例如降低,删除数字和标点符号。问题是在删除标点符号时,所有单词都被附加在一起,没有空格!例如,假设我有以下文本:

text='[{"Hello":"List:","Test"321:[{"Hello":"Airplane Towel for Kitchen"},{"Hello":2 " Repair massive utilities "2},{"Hello":"Some 3 appliance for our kitchen"2}'

text = text.lower() text = text.translate(str.maketrans('','',string.digits))

工作正常,它给出了:

'[{"hello":"list:","test":[{"hello":"airplane towel for kitchen"},{"hello": " repair massives utilities "},{"hello":"some  appliance for our kitchen"}'

但是一旦我想删除标点符号:

text=text.translate(str.maketrans(' ',' ',string.punctuation))

它给了我这个

'hellolisttesthelloairplane towel for kitchenhello nbsprepair massives utilitiesnbsphellosome  appliance for our kitchen'

理想情况下它应该产生:

'hello list test hello airplane towel for kitchen hello nbsp repair massives utilities nbsp hello some  appliance for our kitchen'

我没有使用maketrans进行操作的特定原因,但是我喜欢它快速,简单并且解决方案有些困难。谢谢!

免责声明:我已经知道如何使用 re 来做到这一点,如下所示:

import re
s = "string.]With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

1 个答案:

答案 0 :(得分:1)

嗯...这行得通

txt = text.translate(str.maketrans(string.punctuation, ' ' * len(string.punctuation))).replace(' '*4, ' ').replace(' '*3, ' ').replace(' '*2, ' ').strip()