Question

我想知道如何实现一个函数get_words()，它返回列表中字符串中的单词，删除标点符号。

我希望如何实现该功能，将非string.ascii_letters替换为''并返回.split()。

def get_words(text):

    '''The function should take one argument which is a string'''

    returns text.split()

例如：

>>>get_words('Hello world, my name is...James!')

返回：

>>>['Hello', 'world', 'my', 'name', 'is', 'James']

Answer 1

这与分裂和标点符号无关;你只关心字母（和数字），只想要一个正则表达式：

import re
def getWords(text):
    return re.compile('\w+').findall(text)

演示：

>>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']

如果您不关心数字，请将\w替换为[A-Za-z]仅用字母，或[A-Za-z']以包含收缩等。可能有更好的方法来包含字母 - 非 - 带有其他正则表达式的数字字符类（例如带重音的字母）。

我几乎在这里回答了这个问题：Split Strings with Multiple Delimiters?

但您的问题实际上是指定不足：您是否希望将'this is: an example'拆分为：

['this', 'is', 'an', 'example']
或['this', 'is', 'an', '', 'example']？

我认为这是第一例。

[此'，'是'，'an'，示例']是我想要的。有没有导入正则表达式的方法？如果我们可以用''替换非ascii_letters，然后将字符串拆分成列表中的单词，那会有用吗？ - James Smith 2分钟前

正则表达式是最优雅的，但是，您可以这样做：

def getWords(text):
    """
        Returns a list of words, where a word is defined as a
        maximally connected substring of uppercase or lowercase
        alphabetic letters, as defined by "a".isalpha()

        >>> get_words('Hello world, my name is... Élise!')  # works in python3
        ['Hello', 'world', 'my', 'name', 'is', 'Élise']
    """
    return ''.join((c if c.isalnum() else ' ') for c in text).split()

或.isalpha()

旁注：您也可以执行以下操作，但需要导入另一个标准库：

from itertools import *

# groupby is generally always overkill and makes for unreadable code
# ... but is fun

def getWords(text):
    return [
        ''.join(chars)
            for isWord,chars in 
            groupby(' My name, is test!', lambda c:c.isalnum()) 
            if isWord
    ]

如果这是家庭作业，他们可能正在寻找一种必要的东西，比如两州有限状态机，其中状态是“字母的最后一个字符”，如果状态从字母改变 - ＆gt;非字母，然后你输出一个单词。不要那样做;它不是一个好的编程方式（尽管抽象很有用）。

Answer 2

尝试使用re：

>>> [w for w in re.split('\W', 'Hello world, my name is...James!') if w]
['Hello', 'world', 'my', 'name', 'is', 'James']

虽然我不确定它会抓住你的所有用例。

如果您想以其他方式解决问题，您可以指定您希望在结果中出现的字符：

>>> re.findall('[%s]+' % string.ascii_letters, 'Hello world, my name is...James!')
['Hello', 'world', 'my', 'name', 'is', 'James']

Answer 3

您只需要一个标记器。看看nltk，尤其是WordPunctTokenizer。

从字符串中提取单词，删除标点符号并返回带有分隔单词的列表

3 个答案: