Question

我试图从unicode字符串中有效地去除标点符号。对于常规字符串，使用mystring.translate(None, string.punctuation)显然是fastest approach。但是，此代码在Python 2.7中打破了unicode字符串。正如对此answer的注释所解释的那样，仍然可以实现translate方法，但必须使用字典来实现。当我使用这个implementation时，我发现translate的性能大大降低了。这是我的计时代码（主要从此answer复制）：

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

正如我的结果所示，翻译的unicode实现表现得非常糟糕：

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321

我的问题是，是否有更快的方法来实现优于正则表达式的unicode（或任何其他方法）的翻译。

Answer 1

当前的测试脚本存在缺陷，因为它与之类似。

为了更公平的比较，所有函数必须使用相同的标点字符集（即所有ascii或所有unicode）运行。

完成后，使用完整的unicode标点字符集，正则表达式和替换方法会更加更糟糕。

对于完整的unicode，看起来“set”方法是最好的。但是，如果您只想从unicode字符串中删除ascii标点符号，则最好进行编码，转换和解码（取决于输入字符串的长度）。

通过在尝试更换前进行遏制测试（取决于弦的精确构成），也可以大大改善“替换”方法。

以下是测试脚本重新哈希的一些示例结果：

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

这是重新散列的脚本：

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')

从Python中的unicode字符串中删除标点符号的最快方法

1 个答案: