Question

我有一个非常大的文本文件来解析一些信息。我做的每一行都检查某些关键字（我称之为“标志”）。一旦找到“标志”，我就调用下面的方法并收集标志后面的数据（通常只是一个名称或数字），以便在我使用下面的方法（有效）之后找到标志信息：< / p>

def findValue(string, flag):
    string = string.strip()
    startIndex = string.find(flag) + len(flag)
    index = startIndex
    char = string[index:index+1]
    while char != " " and index < len(string):
        index += 1
        char = string[index:index+1]
    endIndex = index
    return string[startIndex:endIndex]

但是，如果我只使用带有空格的split（）作为分隔符，然后使用列表中的下一项而不是“抓取”字符，则会容易得多。

我正在解析的日志文件非常大（大约150万行或更多行），所以我想知道与我当前的方法相比，在行上使用split（）会有多大麻烦。 / p>

Answer 1

我使用字符串'oabsecaosbeoiabsoeib;asdnvzldkxbcoszievbzldkvn.zlisebv;iszdb;vibzdlkv8niandsailbsdlivbslidznclkxvnlidbvlzidbvlzidbvlkxnv'进行了一些时序测试，搜索'8'，每次100000次：

您的方法：2.156秒

str.split ：0.151秒

另一项测试，更为现实：'hello this is for stack overflow and i absolutely hate typing unecessary characters'

您的方法：0.317秒

str.split ：0.267秒

最后一次测试，上面的字符串乘以100次：

您的方法：0.325秒

str.split ：7.376秒

无论如何说。

在你的情况下，使用超大字符串，我肯定会使用你的功能！

Answer 2

Python的split()函数几乎肯定用C语言编写，这意味着如果用Python编写它，它会比同等代码更快。但是，如果你只是在一行上调用split()（不是全部150万），那么差异就不会很大。

但是，当您只需要列表中的下一个项目时，为什么还要使用split()呢？这可能是任何方法中最有效的方法：

def findValue(string, flag):
    startIndex = string.find(flag) + len(flag)
    endIndex = string.find(' ', startIndex)
    if endIndex == -1:
        return string[startIndex:]
    else:
        return string[startIndex:endIndex]

Answer 3

假设您有一个指向该文件的文件对象：

current_item = ""
char = file.read(1) 
while char:
  if char != " ":
     current_item += char
  else:
     do_something_about_the_item(current_item)
     current_item = ""

Answer 4

您可以尝试python的正则表达式工具re模块，它特别适合解析文本文件。一些例子： http://www.thegeekstuff.com/2014/07/python-regex-examples/

python：split（）效率与角色抓取方法

4 个答案: