Question

Python 3

text = "(CNN)Meaalofa Te'o -- Buemi. Canberra,"

def discard_punctuation(text):
    regex = '\W*^\s^\d*-'
    return re.sub(regex, "", text)

def handle_text(text):
    text_without_punctuation = discard_punctuation(text)
    words_array = text_without_punctuation.split()
    pass // Breakpoint

handle_text(text)

从任意文本我只想选择单词。调查问题，我发现有时连字符在单词内。或者一个数字也可以（9岁，像峡谷一样）。

我的正则表达式是regex ='\ W * ^ \ s ^ \ d * - '。

取所有非字母数字字符;排除所有空格特征，这是分裂方法所必需的;排除所有未跟随连字符的数字。

我还应该排除不是单词的连字符。

结果是：：['（CNN）Meaalofa'，“Te'o”，' - '，'Buemi。'，'堪培拉'，]

文档：https://docs.python.org/3/howto/regex.html

\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

我认为点，逗号，连字符，括号和撇号应匹配\ W.

问题：的 1。我无法理解为什么：括号，点和逗号以及撇号仍然存在。

我会说我排除了撇号。我需要它，它存在于结果中，它是好的。但我无法理解它是如何发生的。 你能帮我理解结果中撇号是如何发生的。
嗯，“ - ”绝对是错误。如何应对？
请你，建议我一个更好的正则表达式。

Answer 1

你对“单词”的定义相当模糊，你可以提出：

import re

rx = re.compile(r'\s*(\S+)\s*')

string = """(CNN)Meaalofa Te'o -- Buemi. Canberra,"""
words = rx.findall(string)
print(words)
# ['(CNN)Meaalofa', "Te'o", '--', 'Buemi.', 'Canberra,']

请参阅a demo on ideone.com和regex101.com。你可能会重新定义“单词”是什么。

Regexp：一些问题

1 个答案: