Question

我正在将文本格式的Databasedump的几个部分导入MySQL，问题是在有趣的数据之前，非常有趣的东西面前。我写了这个循环来获取所需的数据：

def readloop(DBFILE):
    txtdb=open(DBFILE, 'r')

sline = ""

# loop till 1st "customernum:" is found
while sline.startswith("customernum:  ") is False: 
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    data = []
    data.append(sline)
    sline = txtdb.readline()
    while sline.startswith("customernum:  ") is False:
        data.append(sline)
        sline = txtdb.readline()
        if len(sline) == 0:
            break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

Textfile非常庞大，所以只需循环直到第一个想要的条目需要花费很多时间。任何人都有一个想法，如果这可以更快地完成（或者如果整个方式我修复这不是最好的主意）？

非常感谢提前！

Answer 1

请不要写这段代码：

while condition is False:

布尔条件 boolean 用于大声喊叫，因此可以直接测试（或否定和测试）：

while not condition:

你的第二个while循环没有写成“条件为真：”，我很好奇为什么你觉得需要在第一个中测试“是假的”。

拉出dis模块，我想我会进一步剖析这个。在我的pyparsing体验中，函数调用是总性能杀手，因此如果可能的话避免函数调用会很好。这是你原来的测试：

>>> test = lambda t : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_GLOBAL              1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

这里发生了两件昂贵的事情，CALL_FUNCTION和LOAD_GLOBAL。您可以通过为False定义本地名称来缩减LOAD_GLOBAL：

>>> test = lambda t,False=False : t.startswith('customernum') is False
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 LOAD_FAST                1 (False)
             15 COMPARE_OP               8 (is)
             18 RETURN_VALUE

但是如果我们完全放弃'是'测试怎么办？：

>>> test = lambda t : not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_ATTR                0 (startswith)
              6 LOAD_CONST               0 ('customernum')
              9 CALL_FUNCTION            1
             12 UNARY_NOT
             13 RETURN_VALUE

我们使用简单的LOAD_xxx折叠了COMPARE_OP和UNARY_NOT。 “是假的”当然不会帮助表演造成任何影响。

现在如果我们可以完全消除一条线而不进行任何函数调用。如果该行的第一个字符不是'c'，则它无法以startwith（'customernum'）开头。我们试试吧：

>>> test = lambda t : t[0] != 'c' and not t.startswith('customernum')
>>> dis.dis(test)
  1           0 LOAD_FAST                0 (t)
              3 LOAD_CONST               0 (0)
              6 BINARY_SUBSCR
              7 LOAD_CONST               1 ('c')
             10 COMPARE_OP               3 (!=)
             13 JUMP_IF_FALSE           14 (to 30)
             16 POP_TOP
             17 LOAD_FAST                0 (t)
             20 LOAD_ATTR                0 (startswith)
             23 LOAD_CONST               2 ('customernum')
             26 CALL_FUNCTION            1
             29 UNARY_NOT
        >>   30 RETURN_VALUE

（注意，使用[0]获取字符串的第一个字符不创建一个切片 - 这实际上非常快。）

现在，假设没有大量以'c'开头的行，粗剪滤波器可以使用所有相当快的指令消除一条线。实际上，通过测试“t [0]！='c'”而不是“not t [0] =='c'”，我们自己保存了一条无关的UNARY_NOT指令。

因此，使用这种关于捷径优化的学习，我建议更改此代码：

while sline.startswith("customernum:  ") is False:
    sline = txtdb.readline()

while sline.startswith("customernum:  "):
    ... do the rest of the customer data stuff...

对此：

for sline in txtdb:
    if sline[0] == 'c' and \ 
       sline.startswith("customernum:  "):
        ... do the rest of the customer data stuff...

请注意，我还删除了.readline（）函数调用，并使用“for tline in txtdb”迭代文件。

我意识到Alex完全提供了不同的代码体系来寻找第一个'customernum'行，但我会尝试在算法的一般范围内进行优化，然后再拔出大而模糊的块读取枪。

Answer 2

我猜你正在编写这个导入脚本，在测试过程中等待很无聊，所以数据始终保持不变。

您可以运行脚本一次，使用print txtdb.tell()检测要跳转到的文件中的实际位置。写下这些并用txtdb.seek( pos )替换搜索代码。基本上是为文件构建索引; - ）

另一种更为流行的方式是以较大的块读取数据，一次读取几MB，而不仅仅是一行中的几个字节。

Answer 3

优化的一般想法是“通过大块”（大多数忽略线结构）来定位第一条感兴趣的线，然后继续进行其余的线旁处理。它有些挑剔且容易出错（一个接一个等），所以它确实需要测试，但总体思路如下......：

import itertools

def readloop(DBFILE):
  txtdb=open(DBFILE, 'r')
  tag = "customernum:  "
  BIGBLOCK = 1024 * 1024
  # locate first occurrence of tag at line-start
  # (assumes the VERY FIRST line doesn't start that way,
  # else you need a special-case and slight refactoring)
  blob = ''
  while True:
    blob = blob + txtdb.read(BIGBLOCK)
    if not blob:
      # tag not present at all -- warn about that, then
      return
    where = blob.find('\n' + tag)
    if where != -1:  # found it!
      blob = blob[where+1:] + txtdb.readline()
      break
    blob = blob[-len(tag):]
  # now make a by-line iterator over the part of interest
  thelines = itertools.chain(blob.splitlines(1), txtdb)
  sline = next(thelines, '')
  while sline.startswith(tag):
    data = []
    data.append(sline)
    sline = next(thelines, '')
    while not sline.startswith(tag):
      data.append(sline)
      sline = next(thelines, '')
      if not sline:
        break
    customernum = getitem(data, "customernum:  ")
    street = getitem(data, "street:  ")
    country = getitem(data, "country:  ")
    zip = getitem(data, "zip:  ")

在这里，我试图尽可能保持你的结构尽可能完整，除了这个重构的“重要思想”之外，只进行了一些小的改进。

Answer 4

这可能会有所帮助：Python Performance Part 2: Parsing Large Strings for 'A Href' Hypertext

Answer 5

告诉我们有关该文件的更多信息。

您可以使用file.seek进行二分查找吗？寻找中途标记，阅读几行，确定你是否在你需要的部分之前或之后，递归。这会将你的O（n）搜索转换为O（logn）。

如何提高python中这个readline循环的速度？

5 个答案: