Question

这是我正在做的家庭作业。我有一个工作版本的代码，但它目前需要约1小时来运行我们已经给出的文件。我将分享这些文件的示例，以及我的代码（以及高级描述），然后可以使用关于我的代码运行速度的原因。下面的第一个文件是单词文件，我将近似每个单词（表示为数字）出现的次数：

ViewControllers

第二个文件包含我脚本中使用的5个哈希函数的参数：

the_words.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
13
16
17
6
18
19
20
21
22
23
24
25
6
26
27
28
29
30
9
31
32
33
34
15
35
36
37
9
38
39
11
40
13
41
42

这是我的代码版本。在高层次上，I（1）执行我的导入和设置变量，（2）创建一个哈希函数，（3）循环遍历the_words.txt中的单词（这是一个int，令我感到困惑），使用每个单词散列5个散列函数，并在C矩阵中将适当索引中的值增加1。我的代码：

the_hashes.txt
3   1561
17  277
38  394
61  13
78  246

但是，对于一个200M字的文件，这对我来说现在需要太长时间。是否有任何明显的原因导致我的代码运行缓慢？我知道可能需要一段时间来传输超过200M的单词，但我想从目前正在服用的时间内减少它。

谢谢！

Answer 1

如果您无法将数据加载到内存中，则有些部分可以内联并分解：

my_range = range(0, end)  # python 2 only, see note below
with open("the_words.txt", "r") as file:
    for word in file:
        counter = counter + 1
        y = int(word) % p  # factor this out: save 160 million calculations
        # loop over the 5 different pairs of (a,b) values for the hashes
        for i in my_range:
            my_a = the_hashes[i][0]
            my_b = the_hashes[i][1]

            # save a function call by inlining
            # my_output = hash_fun(my_a, my_b, my_p, cols, my_x)

            hash_val = (a*y + b) % p
            my_output = hash_val % n_buckets
            C[i,my_output] += 1

        if(counter % 10000 == 0):
            print counter

我还会查看hash_val = ...中的数学，看看你是否可以计算出一些计算结果。

对于range(0, end)，取决于您正在使用的python版本，您可能希望缓存该调用。见https://stackoverflow.com/a/40294986/1138710）。（我怀疑你的print语句中有python 2。）

另外，我建议您阅读Python performance characteristics以获得一些提高效果的有趣方法，或者至少更好地了解您正在做的事情。

以上只是猜测。查看How can you profile a script?了解如何分析代码并确定瓶颈在哪里。

我的另一个猜测，因为你正在使用numpy，将依赖于它的矩阵计算功能，我认为这将更好地进行优化。 (a*y + b) % p看起来对我很好的矢量数学：）

在python中，加速数据流计数近似算法

1 个答案: