Question

我有一个大型文本文件用于测试，其中包含约1.3亿个单词。为了计算文件中的字数，我编写了以下代码，我称之为“普通解决方案”。

#!/usr/bin/python3.7

with open('v_i_m_utf8.txt') as infile:
    words=0
    for line in infile:
        wordslist = line.split()
        words += len(wordslist)
print(words)

我现在得到的结果：

tony@lenox:~$ time ./counting.py

 134721552

 real   0m29,391s

 user   0m28,907s

 sys    0m0,400s

 tony@lenox:~$

所以，请问是否有可能使用一些python内部技巧和技巧使它更快地处理字符串？

我只需要计算单词数，并尽速执行Python运行时即可。

Answer 1

读取整个文件，而不是逐行阅读。

words = len(infile.read().split())

Answer 2

Cython计数吗？

我的计算机上的时间是：

OP的示例耗时6.5s
乔治花了5.3秒
此Cython代码耗时0.65秒
类似的C版本需要0.73秒（不确定为什么要比Cython更长）

使用cdef extern from "ctype.h": int isspace(int x) def cfunc(fd): cdef bytes buf cdef int tot = 0, prev = 0, cur cdef char c while True: buf = fd.read(8192) if not buf: return tot for c in buf: cur = isspace(c) if cur and not prev: tot += 1 prev = cur

如何通过一些技巧更快地完成此任务？

2 个答案: