翻译功能和unicode转换

时间:2016-05-05 06:11:54

标签: python unicode

我正在尝试从下面的文字中删除标点符号。我正在将文本转换为unicode,以避免以后出现任何编码问题。

import string
st = "I absolutely go incredibly far. Zach went fast over crab sand land.\n\nThis is a new paragraph. This is the second sentence in that paragraph. This exsquisite utterance is indubitably the third sentence of this fine text.\n\nPlagiarism detection can be operationalized by decomposing a document into natural sections, such as sentences, chapters, or topically related blocks, and analyzing the variance of stylometric features for these sections. In this regard the decision problems in Sect. 1.2 are of decreasing complexity: instances of AVFIND are comprised of both a selection problem (finding suspicious sections) and an AVOUTLIER problem; instances of AVBATCH are a restricted variant of AVOUTLIER since one has the additional knowledge that all elements of a batch are (or are not) outliers at the same time."
st = unicode(st, errors = 'ignore')
for word in st.split(' '):
    wd = word.lower().translate(string.maketrans("",""), string.punctuation)
    print wd

然而,translate函数莫名其妙地引发了关于参数数量的错误。

TypeError: translate() takes exactly one argument (2 given)

删除unicode转换步骤可确保正确执行,但必须使用translate功能。如何在没有任何错误的情况下实现我的目标并保持这两个功能?

2 个答案:

答案 0 :(得分:2)

str.translate()unicode.translate()采用不同的论点。这违反了LSP,但鉴于Unicode字符串中有大量字符,这是必需的。

word.lower().translate(dict((x, None) for x in string.punctuation))

答案 1 :(得分:2)

那是因为您拨打的是unicode.translate(),而不是str.translate()

>>> help(unicode.translate)
translate(...)
    S.translate(table) -> unicode

    Return a copy of the string S, where all characters have been mapped
    through the given translation table, which must be a mapping of
    Unicode ordinals to Unicode ordinals, Unicode strings or None.
    Unmapped characters are left untouched. Characters mapped to None
    are deleted.

这应该是相同的,即删除标点字符:

wd = word.lower().translate({ord(c): None for c in string.punctuation})

顺便说一句,对于str个对象,你可以这样做:

wd = word.lower().translate(None, string.punctuation)

即。当为转换表指定None时,第二个参数中的字符将被删除。