Question

我编写了一个tokenize函数，它基本上读取字符串表示并将其拆分为单词列表。

我的代码：

def tokenize(document):
    x = document.lower() 
    return re.findall(r'\w+', x)

我的输出：

tokenize("Hi there. What's going on? first-class")
['hi', 'there', 'what', 's', 'going', 'on', 'first', 'class']

期望的输出：

['hi', 'there', "what's", 'going', 'on', 'first-class']

基本上我希望撇号词和超单词在列表中保留为单个词和双引号。如何更改我的功能以获得所需的输出。

Answer 1

\w+匹配一个或多个字符;这不包括撇号或连字符。

你需要在这里使用character set来告诉Python你想要匹配的内容：

>>> import re
>>> def tokenize(document):
...     return re.findall("[A-Za-z'-]+", document)
...
>>> tokenize("Hi there. What's going on? first-class")
['hi', 'there', "what's", 'going', 'on', 'first-class']
>>>

你也会注意到我删除了x = document.lower()行。这不再是必需的，因为我们可以通过简单地将A-Z添加到字符集来匹配大写字符。

python拆分文本文件功能

1 个答案: