Question

我将一个单词定义为一个字符序列（从a到Z），它也可能包含一个撇号。我希望将一个句子分成单词，从单词中删除撇号。

我目前正在执行以下操作来从一段文字中获取文字。

import re
text = "Don't ' thread \r\n on \nme ''\n "
words_iter = re.finditer(r'(\w|\')+', text)
words = (word.group(0).lower() for word in words_iter)
for i in words:
    print(i)

这给了我：

don't
'
thread
on
me
''

但我不想要的是：

dont
thread
on
me

如何更改代码才能实现此目的？

请注意，我的输出中没有'。

我也希望words成为一名发电机。

Answer 1

这看起来像是Regex的工作。

import re

text = "Don't ' thread \r\n on \nme ''\n "

# Define a function so as to make a generator
def get_words(text):

    # Find each block, separated by spaces
    for section in re.finditer("[^\s]+", text):

        # Get the text from the selection, lowercase it
        # (`.lower()` for Python 2 or if you hate people who use Unicode)
        section = section.group().casefold()

        # Filter so only letters are kept and yield
        section = "".join(char for char in section if char.isalpha())
        if section:
            yield section

list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']

正则表达式的解释：

[^    # An "inverse set" of characters, matches anything that isn't in the set
\s    # Any whitespace character
]+    # One or more times

所以这匹配任何非空白字符块。

Answer 2

words = (x.replace("'", '') for x in text.split())
result = tuple(x for x in words if x)

...只对分割数据进行一次迭代。

如果数据集很大，请使用re.finditer代替str.split()，以避免将整个数据集读入内存：

words = (x.replace("'", '') for x in re.finditer(r'[^\s]+', text))
result = tuple(x for x in words if x)

...尽管如此，tuple() - 数据将会读取内存中的所有内容。

Answer 3

使用str.translate和re.finditer：

>>> text = "Don't ' thread \r\n on \nme ''\n "
>>> import re
>>> from string import punctuation
>>> tab = dict.fromkeys(map(ord, punctuation))
def solve(text):
    for m in re.finditer(r'\b(\S+)\b', text):
        x = m.group(1).translate(tab).lower()
        if x : yield x
>>> list(solve(text))
['dont', 'thread', 'on', 'me']

时间比较：

>>> strs = text * 1000
>>> %timeit list(solve(strs))
10 loops, best of 3: 11.1 ms per loop
>>> %timeit list(get_words(strs))
10 loops, best of 3: 36.7 ms per loop
>>> strs = text * 10000
>>> %timeit list(solve(strs))
1 loops, best of 3: 146 ms per loop
>>> %timeit list(get_words(strs))
1 loops, best of 3: 411 ms per loop

Answer 4

import string
tuple(str(filter(lambda x: x if x in string.letters + string.whitespace else '', "strings don't have '")).split())

字符串到单词的元组

4 个答案:

时间比较：