Python从除撇号

时间:2015-04-28 21:29:55

标签: python regex unicode punctuation

我找到了几个这方面的主题,我找到了这个解决方案:

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

这应该删除除了'之外的所有标点符号,问题是它还会删除句子中的所有其他标点符号。

示例:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

当然我想要的是保持句子没有标点符号," warhol""保持原样

期望的输出:

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

编辑:  我也尝试过使用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

但这会删除每个标点符号

1 个答案:

答案 0 :(得分:9)

指定要删除的所有元素,即\w\d\s等。这就是{{1}运算符表示方括号。 (匹配除了之外的任何东西)

^