我将一个单词定义为一个字符序列(从a到Z),它也可能包含一个撇号。我希望将一个句子分成单词,从单词中删除撇号。
我目前正在执行以下操作来从一段文字中获取文字。
import re
text = "Don't ' thread \r\n on \nme ''\n "
words_iter = re.finditer(r'(\w|\')+', text)
words = (word.group(0).lower() for word in words_iter)
for i in words:
print(i)
这给了我:
don't
'
thread
on
me
''
但我不想要的是:
dont
thread
on
me
如何更改代码才能实现此目的?
请注意,我的输出中没有'
。
我也希望words
成为一名发电机。
答案 0 :(得分:3)
这看起来像是Regex的工作。
import re
text = "Don't ' thread \r\n on \nme ''\n "
# Define a function so as to make a generator
def get_words(text):
# Find each block, separated by spaces
for section in re.finditer("[^\s]+", text):
# Get the text from the selection, lowercase it
# (`.lower()` for Python 2 or if you hate people who use Unicode)
section = section.group().casefold()
# Filter so only letters are kept and yield
section = "".join(char for char in section if char.isalpha())
if section:
yield section
list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']
正则表达式的解释:
[^ # An "inverse set" of characters, matches anything that isn't in the set
\s # Any whitespace character
]+ # One or more times
所以这匹配任何非空白字符块。
答案 1 :(得分:1)
words = (x.replace("'", '') for x in text.split())
result = tuple(x for x in words if x)
...只对分割数据进行一次迭代。
如果数据集很大,请使用re.finditer
代替str.split()
,以避免将整个数据集读入内存:
words = (x.replace("'", '') for x in re.finditer(r'[^\s]+', text))
result = tuple(x for x in words if x)
...尽管如此,tuple()
- 数据将会读取内存中的所有内容。
答案 2 :(得分:0)
使用str.translate
和re.finditer
:
>>> text = "Don't ' thread \r\n on \nme ''\n "
>>> import re
>>> from string import punctuation
>>> tab = dict.fromkeys(map(ord, punctuation))
def solve(text):
for m in re.finditer(r'\b(\S+)\b', text):
x = m.group(1).translate(tab).lower()
if x : yield x
>>> list(solve(text))
['dont', 'thread', 'on', 'me']
>>> strs = text * 1000
>>> %timeit list(solve(strs))
10 loops, best of 3: 11.1 ms per loop
>>> %timeit list(get_words(strs))
10 loops, best of 3: 36.7 ms per loop
>>> strs = text * 10000
>>> %timeit list(solve(strs))
1 loops, best of 3: 146 ms per loop
>>> %timeit list(get_words(strs))
1 loops, best of 3: 411 ms per loop
答案 3 :(得分:0)
import string
tuple(str(filter(lambda x: x if x in string.letters + string.whitespace else '', "strings don't have '")).split())