Question

我遇到过这个应该标记给定句子的函数

def basic_tokenizer(sentence):
    words = []
    for space_separated_fragment in sentence.strip().split():
        words.extend(re.split(" ", space_separated_fragment))
    return [w for w in words if w]

因为我看到它的句子.strip（）。split（）应该已经足够但是然后使用了re.split（），然后在返回中使用了[w for w in words w]

我想知道这可能是什么原因？通过这三个不同的例子将不胜感激

Answer 1

整个功能可以缩短为：

def basic_tokenizer(sentence):
    return sentence.split()

<强>为什么：

sentence.strip().split()已经剥离了结束空格，并在whitesapces上进行了分割，没有必要在结果列表上进行迭代，而extend - 通过words列表>再次拆分空格（words.extend(re.split(" ", space_separated_fragment))）
此外，在[w for w in words if w] if w检查也是多余的，因为不存在任何假元素（因为所有都是非空字符串）

在python中re.split（）vs str.split（）

1 个答案: