Question

代码：

import re

def main():
    a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
    b=word_find(a)
    print(b)

def word_find(sentence_list):
    word_list=[]
    word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")
    for i in range(len(sentence_list)):
        words=re.findall(word_reg,sentence_list[i])
        word_list.append(words)
    return word_list

main()

我需要将每个单词分解为列表的单个元素

现在输出看起来像这样：

[['the', 'mississippi', 'is', 'well', 'worth', 'reading'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways']]

我发现第一句话'about'和第二句话'remarkable'的最后一个单词丢失了

我的正则表达式可能有问题

word_reg=re.compile(r"[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]?(.+?)[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]")

但是如果我在此正则表达式的最后部分添加问号，如下所示：

[\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;]**?**")

结果变成许多单个字母而不是单词。我该怎么办？

编辑：

我之所以不使用 string.split 的原因是，人们可能有很多方式来打断单词

例如：当人们输入a--b时，没有空格，但是我们必须将其分成'a'，'b'

Answer 1

使用 right 工具始终是制胜法宝。在您的情况下，正确的工具是NLTK单词标记器，因为它被设计为这样做：将句子分解为单词。

import nltk
a = ['the mississippi is well worth reading about', 
     ' it is not a commonplace river, but on the contrary is in all ways remarkable']
nltk.word_tokenize(a[1])
#['it', 'is', 'not', 'a', 'commonplace', 'river', ',', 'but', 
# 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']

Answer 2

建议一个更简单的解决方案：

b = re.split(r"[\W_]", a)

正则表达式[\W_]与任何单个非单词字符（非字母，非数字和非下划线）加上下划线匹配，这实际上就足够了。

您当前的正则表达式要求单词后面是列表中的字符之一，但不包括“行尾”，可以与$匹配。

Answer 3

您可以找到不需要的内容，然后在上面拆分：

>>> a=['the mississippi is well worth reading about', ' it is not a commonplace river, but on the contrary is in all ways remarkable']
>>> [re.split(r'\W+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['', 'it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

（您可能需要过滤''产生的re.split元素）

或者使用re.findall捕获想要的内容并保留这些元素：

>>> [re.findall(r'\b\w+', s) for s in a]
[['the', 'mississippi', 'is', 'well', 'worth', 'reading', 'about'], ['it', 'is', 'not', 'a', 'commonplace', 'river', 'but', 'on', 'the', 'contrary', 'is', 'in', 'all', 'ways', 'remarkable']]

Answer 4

您可以使用re.split和filter：

filter(None, re.split("[, \-!?:]+", a])

在我放置字符串"[, \-!?:]+"的位置，您应该放置任何分隔符。 filter只会删除由于前导/尾随分隔符而产生的空字符串。

Answer 5

谢谢大家

从其他答案中，解决方案是使用re.split（）

在最上方的答案中有一个超级明星 NLTK

def word_find(sentence_list):
    word_list=[]
    for i in range(len(sentence_list)):
        word_list.append(re.split('\(|\)|,|\'|\"|:|\[|\]|\{|\}| |\-\-+|\t|;',sentence_list[i]))
    return word_list

使用正则表达式时缺少句子中的最后一个单词

5 个答案: