Question

我的文件是一行，一行中大约有100,000个单词。

如何才能以最快，最有效的方式提取长度大于或等于4的单词？

我考虑过使用正则表达式，但我不确定这是否是最好的方法。

Answer 1

列表理解效果很好：

[word for word in line.split() if len(word) >= 4]

Answer 2

您可以在re＆＃39; d文件上使用mmap，例如：

import mmap, re

with open('somefile') as fin:
    mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
    words = re.findall('\w{4,}', mf)

Answer 3

它们是否空间分开？您可以使用csv reader并将分隔符设置为空格，然后将其循环到len（）＆gt; = 4。

更好的方法是使用this feature request中的自定义换行文件迭代器并将换行符设置为＆＃39; ＆＃39 ;. （您可以点击fileLineIter()的代码链接。

f = open(filename,'rb')
for word in fileLineIter(f,' ',' '):
    if len(word) >=4:
        do_something()