使用python

时间:2018-06-21 19:41:08

标签: python python-3.x file search profiling

我正试图找出哪种方法是在大型文件中搜索字符串的内存和计算效率最高的方法。

我能想到的三种方法

  1. 逐行读取并在找到字符串时停止。

  2. 小文件的标准方式:string in open(filename).read()

  3. 按照建议的here使用内存映射。 answer还说明了应该通过不直接在基础文件中搜索将文件读入内存来解决可能的内存问题。

据此,应该期望的是:1.如果较早遇到搜索字符串,则应该是快速的并且具有存储效率,否则会很慢; 2。应该真的很慢并且会消耗内存,并且3.应该具有很高的存储效率并且如果较早遇到搜索字符串,也会很快,否则会慢。

我使用this large (164MB) file from the USPTO和以下脚本对此进行了测试:

import time
import mmap

@profile
def test1(string, filename):
    with open(filename, "r") as infile:
        for line in infile:
            if string in line:
                return True
    return False

@profile
def test2(string, filename):
    return string in open(filename).read()

@profile
def test3(string, filename):
    # from https://stackoverflow.com/a/4944929/1735215
    with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
        if s.find(bytes(string, "utf-8")) != -1:
            return True
    return False

def test_performance():
    teststrings = ['PATDOC', 'chargeable program.</PDAT></PTEXT>'] 
    testfile = 'pg020212.XML'

    for teststring in teststrings:
        t0 = time.time()
        print(test1(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

        t0 = time.time()
        print(test2(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

        t0 = time.time()
        print(test3(teststring, testfile))
        t1 = time.time()
        print(t1 - t0)

test_performance()

该脚本会测试,时间配置文件和内存配置文件这三种方法,首先针对在文件中较早找到的字符串,然后对仅在结尾处找到的另一个字符串进行测试。 会发生什么:

$ python -m memory_profiler search_efficiency_test.py
True
0.0007505416870117188                 # Method 1, string early in the file
True
0.47335124015808105                   # Method 2, string early in the file
True
0.00037598609924316406                # Method 3, string early in the file
True
171.73131465911865                    # Method 1, string at the bottom of the file
True
0.5401151180267334                    # Method 2, string at the bottom of the file
True
0.15198254585266113                   # Method 3, string at the bottom of the file
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   32.652 MiB   65.289 MiB   @profile
     5                             def test1(string, filename):
     6   32.652 MiB    0.000 MiB       with open(filename, "r") as infile:
     7   32.652 MiB    0.000 MiB           for line in infile:
     8   32.652 MiB    0.000 MiB               if string in line:
     9   32.652 MiB    0.000 MiB                   return True
    10                                 return False


Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    12   32.652 MiB   65.289 MiB   @profile
    13                             def test2(string, filename):
    14   32.695 MiB    0.059 MiB       return string in open(filename).read()


Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    16   32.695 MiB   65.348 MiB   @profile
    17                             def test3(string, filename):
    18                                 # from https://stackoverflow.com/a/4944929/1735215
    19   32.695 MiB    0.000 MiB       with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
    20  193.449 MiB  160.754 MiB           if s.find(bytes(string, "utf-8")) != -1:
    21   32.695 MiB -160.754 MiB               return True
    22                                 return False

所以:

  • 对于在文件方法中早期发现的字符串,方法2花费的时间超过1和3,但是对于大小不那么长的文件花费0.5秒。

  • 对于仅在结尾处遇到的字符串(或根本不遇到),方法1确实花费了很长时间,方法2和3都非常快,而方法3则比方法2快。

  • 1和2似乎并没有使用任何实质性的内存,而据说内存效率更高的methor 3使用了大量的内存。

这是怎么回事?我是否误解了这些方法的工作方式?为什么方法1比方法2花费更长的时间?为什么方法1和2使用的内存少于3?还是内存分析器不被信任?

软件为:Linux内核4.15.15,python 3.6.4,memory-profiler 0.52.0(通过pip安装)。

编辑:

按照@Barmar的建议,我分别对这三种方法分别进行了测试(另两种方法每次都注释掉了。但是,测试结果似乎并没有太大变化。请确保方法1的不良性能是不是由脚本终止时文件以某种方式保留在内存中引起的,我以不同的顺序运行它(先测试2,然后是1,然后是3):

$ python -m memory_profiler search_efficiency_test.py
True
1.2371280193328857
True
0.5623576641082764
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    12   32.578 MiB   32.578 MiB   @profile
    13                             def test2(string, filename):
    14   32.578 MiB    0.000 MiB       return string in open(filename, "r", encoding="utf8", errors='ignore').read()


$ python -m memory_profiler search_efficiency_test.py
True
0.000804901123046875
True
178.5931453704834
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
     4   32.582 MiB   32.582 MiB   @profile
     5                             def test1(string, filename):
     6   32.582 MiB    0.000 MiB       with open(filename, "r") as infile:
     7   32.582 MiB    0.000 MiB           for line in infile:
     8   32.582 MiB    0.000 MiB               if string in line:
     9   32.582 MiB    0.000 MiB                   return True
    10                                 return False


$ python -m memory_profiler search_efficiency_test.py
True
0.0006160736083984375
True
0.16133618354797363
Filename: /home/user/search_efficiency_test.py

Line #    Mem usage    Increment   Line Contents
================================================
    16   32.688 MiB   32.688 MiB   @profile
    17                             def test3(string, filename):
    18                                 # from https://stackoverflow.com/a/4944929/1735215
    19   32.688 MiB    0.000 MiB       with open(filename, 'rb', 0) as infile, mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ) as s:
    20  193.281 MiB  160.594 MiB           if s.find(bytes(string, "utf-8")) != -1:
    21   32.688 MiB -160.594 MiB               return True
    22                                 return False

0 个答案:

没有答案