Question

我需要从文本文件中读取行，但是'行尾'字符并不总是\ n或\ x或组合，可能是'xyz'或'|'等字符的任意组合，但是'end of line'始终相同，并且对于每种类型的文件都是已知的。

由于文本文件可能很大，我必须记住性能和内存使用情况，这似乎是最好的解决方案？今天我使用了string.read（1000）和split（myendofline）或分区（myendofline）的组合，但我知道是否存在更优雅和标准的解决方案。

Answer 1

显然最简单的方法是阅读整个内容然后调用.split('|')。

然而，如果这是不合需要的，因为它要求你将整个内容读入内存，你可能会以任意块读取并对它们执行拆分。您可以编写一个类，当当前的一个块耗尽时抓取另一个任意块，并且您的应用程序的其余部分不需要知道它。

这是输入，zen.txt

The Zen of Python, by Tim Peters||Beautiful is better than ugly.|Explicit is better than implicit.|Simple is better than complex.|Complex is better than complicated.|Flat is better than nested.|Sparse is better than dense.|Readability counts.|Special cases aren't special enough to break the rules.|Although practicality beats purity.|Errors should never pass silently.|Unless explicitly silenced.|In the face of ambiguity, refuse the temptation to guess.|There should be one-- and preferably only one --obvious way to do it.|Although that way may not be obvious at first unless you're Dutch.|Now is better than never.|Although never is often better than *right* now.|If the implementation is hard to explain, it's a bad idea.|If the implementation is easy to explain, it may be a good idea.|Namespaces are one honking great idea -- let's do more of those!

这是我的小测试用例，对我有用。它不处理整个角落的情况，也不是特别漂亮，但它应该让你开始。

class SpecialDelimiters(object):
    def __init__(self, filehandle, terminator, chunksize=10):
        self.file = filehandle
        self.terminator = terminator
        self.chunksize = chunksize
        self.chunk = ''
        self.lines = []
        self.done = False

    def __iter__(self):
        return self

    def next(self):
        if self.done:
            raise StopIteration
        try:
            return self.lines.pop(0)
        except IndexError:
            #The lines list is empty, so let's read some more!
            while True:
                #Looping so even if our chunksize is smaller than one line we get at least one chunk
                newchunk = self.file.read(self.chunksize)
                self.chunk += newchunk
                rawlines = self.chunk.split(self.terminator)
                if len(rawlines) > 1 or not newchunk:
                    #we want to keep going until we have at least one block
                    #or reached the end of the file
                    break
            self.lines.extend(rawlines[:-1])
            self.chunk = rawlines[-1]
            try:
                return self.lines.pop(0)
            except IndexError:
                #The end of the road, return last remaining stuff
                self.done = True
                return self.chunk               

zenfh = open('zen.txt', 'rb')
zenBreaker = SpecialDelimiters(zenfh, '|')
for line in zenBreaker:
    print line

Answer 2

这是 生成器函数 ，它们在文件中充当 迭代器 ，根据异国情调的换行符剪切线条在所有文件中都是相同的。

它按lenchunk个字符块读取文件，并显示每个当前块中的行，块后面的块。

由于换行符中的换行符是3个字符（'：;：'），因此可能会发生一个块以换行换行符结束：此生成器函数负责处理这种可能性并设法显示正确的行

如果换行只有一个字符，则可以简化该功能。我只为最精巧的案例编写了函数。

使用此功能可以一次读取一行文件，而无需将整个文件读入内存。

from random import randrange, choice # this part is to create an exemple file with newline being :;: alphabet = 'abcdefghijklmnopqrstuvwxyz ' ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40))) for i in xrange(50)) with open('fofo.txt','wb') as g: g.write(ch) # this generator function is an iterator for a file # if nl receives an argument whose bool is True, # the newlines :;: are returned in the lines def liner(filename,eol,lenchunk,nl=0): # nl = 0 or 1 acts as 0 or 1 in splitlines() L = len(eol) NL = len(eol) if nl else 0 with open(filename,'rb') as f: chunk = f.read(lenchunk) tail = '' while chunk: last = chunk.rfind(eol) if last==-1: kept = chunk newtail = '' else: kept = chunk[0:last+L] # here: L newtail = chunk[last+L:] # here: L chunk = tail + kept tail = newtail x = y = 0 while y+1: y = chunk.find(eol,x) if y+1: yield chunk[x:y+NL] # here: NL else: break x = y+L # here: L chunk = f.read(lenchunk) yield tail for line in liner('fofo.txt',':;:'): print line

这是相同的，在这里和那里打印以允许遵循算法。

from random import randrange, choice # this part is to create an exemple file with newline being :;: alphabet = 'abcdefghijklmnopqrstuvwxyz ' ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,40))) for i in xrange(50)) with open('fofo.txt','wb') as g: g.write(ch) # this generator function is an iterator for a file # if nl receives an argument whose bool is True, # the newlines :;: are returned in the lines def liner(filename,eol,lenchunk,nl=0): L = len(eol) NL = len(eol) if nl else 0 with open(filename,'rb') as f: ch = f.read() the_end = '\n\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'+\ '\nend of the file=='+ch[-50:]+\ '\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n' f.seek(0,0) chunk = f.read(lenchunk) tail = '' while chunk: if (chunk[-1]==':' and chunk[-3:]!=':;:') or chunk[-2:]==':;': wr = [' ##########---------- cut newline cut ----------##########'+\ '\nchunk== '+chunk+\ '\n---------------------------------------------------'] else: wr = ['chunk== '+chunk+\ '\n---------------------------------------------------'] last = chunk.rfind(eol) if last==-1: kept = chunk newtail = '' else: kept = chunk[0:last+L] # here: L newtail = chunk[last+L:] # here: L wr.append('\nkept== '+kept+\ '\n---------------------------------------------------'+\ '\nnewtail== '+newtail) chunk = tail + kept tail = newtail wr.append('\n---------------------------------------------------'+\ '\ntail + kept== '+chunk+\ '\n---------------------------------------------------') print ''.join(wr) x = y = 0 while y+1: y = chunk.find(eol,x) if y+1: yield chunk[x:y+NL] # here: NL else: break x = y+L # here: L print '\n\n===================================================' chunk = f.read(lenchunk) yield tail print the_end for line in liner('fofo.txt',':;:',1): print 'line== '+line

编辑

我比较了我的代码和chmullig代码的执行次数。

使用“fofo.txt”文件大约10 MB，使用
创建
alphabet = 'abcdefghijklmnopqrstuvwxyz ' ch = ':;:'.join(''.join(choice(alphabet) for nc in xrange(randrange(0,60))) for i in xrange(324000)) with open('fofo.txt','wb') as g: g.write(ch)

并测量时间：

te = clock() for line in liner('fofo.txt',':;:', 65536): pass print clock()-te fh = open('fofo.txt', 'rb') zenBreaker = SpecialDelimiters(fh, ':;:', 65536) te = clock() for line in zenBreaker: pass print clock()-te

我在几篇论文中获得了以下最短时间：


............我的代码0,7067秒

chmullig的代码0.8373秒

编辑2

我更改了我的生成器函数：liner2()采用文件处理程序而不是文件名。因此，文件的打开可以用于测量时间，就像测量chmullig的代码一样

def liner2(fh,eol,lenchunk,nl=0): L = len(eol) NL = len(eol) if nl else 0 chunk = fh.read(lenchunk) tail = '' while chunk: last = chunk.rfind(eol) if last==-1: kept = chunk newtail = '' else: kept = chunk[0:last+L] # here: L newtail = chunk[last+L:] # here: L chunk = tail + kept tail = newtail x = y = 0 while y+1: y = chunk.find(eol,x) if y+1: yield chunk[x:y+NL] # here: NL else: break x = y+L # here: L chunk = fh.read(lenchunk) yield tail fh = open('fofo.txt', 'rb') te = clock() for line in liner2(fh,':;:', 65536): pass print clock()-te

结果，经过无数篇文章看到最短时间后，


......... with liner（）0.7067seconds

....... with liner2（）0.7064秒

chmullig的代码0.8373秒

实际上，文件的打开在总时间内占无穷小部分。

Answer 3

考虑到你的约束，最好先将已知不寻常的换行符转换为 normal 换行符，然后再按常规使用：

for line in file:
    ...

Answer 4

TextFileData.split(EndOfLine_char)似乎是您的解决方案。如果它的工作速度不够快，那么你应该考虑使用较低级别的编程级别。

使用Python从文本文件中读取的行中的特殊结束字符/字符串

4 个答案: