Question

我正在尝试将一个句子标记为单词。在下面的代码中，我尝试使用一些预定义的拆分参数将句子拆分为单词。

import re
_WORD_SPLIT = re.compile(b"([.,!?\"':;)(])")

def basic_tokenizer(sentence):
    words = []
    for space_separated_fragment in sentence.strip().split():
        words.extend(_WORD_SPLIT.split(space_separated_fragment))
    return [w for w in words if w]

basic_tokenizer("I live, in Mumbai.")

它告诉我一个错误：

TypeError：无法在类似字符串的对象上使用字节模式。

之前，此代码对我来说是正确的，但是当我重新安装tensorflow时，它向我显示错误。我还使用了.decode()函数，但没有解决我的问题。

我正在Ubuntu上使用python3.6。

Answer 1

您在编译re时给出了一个字节对象，而在调用它时，您给出的是字符串对象space_seprated_fragment

将其转换为字节，同时将其传递到_WORD_SPLIT：

import re
_WORD_SPLIT = re.compile(b"([.,!?\"':;)(])")

def basic_tokenizer(sentence):
    words = []
    for space_separated_fragment in sentence.strip().split():
        words.extend(_WORD_SPLIT.split(space_separated_fragment.encode()))
    return [w for w in words if w]

basic_tokenizer("I live, in Mumbai.")

Answer 2

re.compile采用普通字符串。 re.compile

import re
_WORD_SPLIT = re.compile("([.,!?\"':;)(])")

def basic_tokenizer(sentence):
    words = []
    for space_separated_fragment in sentence.strip().split():
        words.extend(_WORD_SPLIT.split(space_separated_fragment))
    return [w for w in words if w]
print(basic_tokenizer("I live, in Mumbai."))
#['I', 'live', ',', 'in', 'Mumbai', '.']

TypeError：无法在类似字符串的对象上使用字节模式

2 个答案: