Question

我想将文本文件转换为字符串，并获得了以Python 2编写的此功能作为开始。

def parseOutText(f):
    f.seek(0)  
    all_text = f.read()

    content = all_text.split("X-FileName:")
    words = ""
    if len(content) > 1:
        text_string = content[1].translate(string.maketrans("", ""), string.punctuation)

        words = text_string

        ### split the text string into individual words, stem each word,
        ### and append the stemmed word to words (make sure there's a single
        ### space between each stemmed word)

    return words

如您所见，我必须向该函数添加一些代码，但是它不能编译（编译器给出错误，说'string'没有函数'maketrans'）。我确信这段代码可以轻松地翻译成Python 3，但是直到注释行我才真正理解它的作用。它只是省略标点符号并将文本转换为字符串吗？

Answer 1

Python 3.x maketrans和translate具有其Python 2之前版本的所有基本功能，以及更多—但它们具有不同的API。因此，您必须了解他们在使用它们时正在做什么。

translate在2.x中采用了非常简单的table，由string.maketrans制作，另外还有一个单独的deletechars列表。

在3.x中，该表更加复杂（很大程度上是因为它现在正在转换Unicode字符，而不是字节，但它还具有其他新功能）。该表由静态方法str.maketrans而不是函数string.maketrans制成。并且该表包括删除列表，因此您不需要为translate设置单独的参数。

从文档中

static str.maketrans(x[, y[, z]])

此静态方法返回可用于str.translate()的转换表。

如果只有一个参数，则它必须是将Unicode序号（整数）或字符（长度为1的字符串）映射到Unicode序号，字符串（任意长度）或None的字典。然后，字符键将转换为普通字符。

如果有两个参数，则它们必须是长度相等的字符串，并且在生成的字典中， x 中的每个字符都将映射到 y <中的相同位置的字符/ em>。如果有第三个参数，则必须是一个字符串，其字符将映射到结果中的None。

因此，要创建一个删除所有标点符号并且在3.x中不执行其他操作的表，请执行以下操作：

table = str.maketrans('', '', string.punctuation)

并应用它：

translated = s.translate(table)

同时，由于您正在处理Unicode，因此确定string.punctuation是您想要的吗？正如the docs所说，这是：

在C语言环境中被视为标点符号的ASCII字符串。

例如，不会删除大括号，英语以外的标点符号等。

如果这是一个问题，则必须执行以下操作：

translated = ''.join(ch for ch in s if unicodedata.category(ch)[0] != 'P')

Answer 2

所以我找到了这段代码，它的工作原理就像一个魅力：

exclude = set(string.punctuation)
string = ''.join(ch for ch in string if ch not in exclude)

Answer 3

更改此行

text_string = content[1].translate(string.maketrans("", ""), string.punctuation)'

对此

text_string = content[1].translate((str.maketrans("", ""), string.punctuation)) '

将文本文件转换为字符串（Python 3）

3 个答案: