Python - 如何自动将标题/章节拆分为单独的文件

时间:2015-11-22 23:48:38

标签: python file split header rewind

我正在将文本直接转换为epub,而且我自动将HTML book文件拆分为单独的header / chapter文件。目前,下面的代码部分工作,但只创建每隔一个章节文件。因此输出中缺少一半的头/章文件。这是代码:

def splitHeaderstoFiles(fpath):

infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:

    # format and split headers to files
    if '<h1' in line:   

       #-----------format header file names and other stuff ------------#

        # create a new file for the header/chapter section
        path = os.getcwd() + os.sep + header
        with open(path, 'wt', encoding=('utf-8')) as outfp:            

            # write html top meta headers
            outfp = addMetaHeaders(outfp)
            # add the header
            outfp = outfp.write(line)

            # add the chapter/header bodytext
            for line in infp:
                if '<h1' not in line:
                    outfp.write(line)
                else:                     
                    outfp.write('</body>\n</html>')         
                    break                
    else:          
        continue

infp.close() 

问题发生在代码底部的第二个“for循环”中,当我查找下一个h1标记以停止分割时。我不能使用seek()或tell()来回退或向后移动一行,以便程序可以在下一次迭代中找到下一个标题/章节。显然你不能在包含隐式iter或操作中的下一个对象的for循环中在python中使用它们。只是给出一个'不能做非零相对寻找'的错误。

我还在代码中尝试了 while line!=''+ readline()组合,这也给出了与上面相同的错误。

有没有人知道将不同长度的HTML标题/章节拆分为python中的单独文件的简单方法?是否有任何特殊的python模块(如泡菜)可以帮助简化这项任务?

我正在使用Python 3.4

我对此问题的解决方案提前表示感谢...

3 个答案:

答案 0 :(得分:2)

前一段时间我遇到过类似的问题,这是一个简化的解决方案:

from itertools import count

chapter_number = count(1)
output_file = open('000-intro.html', 'wb')

with open('index.html', 'rt') as input_file:
    for line in input_file:
        if '<h1' in line:
            output_file.close()
            output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
        output_file.write(line)

output_file.close()

在这种方法中,导致第一个h1块的第一个文本块被写入 000-intro.html ,第一章将被写入 001- chapter.html 等。请修改它以品尝。

解决方案很简单:遇到h1标记后,关闭最后一个输出文件并打开一个新文件。

答案 1 :(得分:0)

您正在循环输入文件两次,这可能会导致您的问题:

for line in infp:
    ...
    with open(path, 'wt', encoding=('utf-8')) as outfp:            
        ...
        for line in infp:
            ...

每个for都将拥有它自己的迭代器,所以你要多次遍历文件。

您可能会尝试将for循环转换为一段时间,因此您不会使用两个不同的迭代器:

while infp: 
    line = infp.readline()
    if '<h1' in line:
        with open(...) as outfp:
            while infp:                
                line = infp.readline()
                if '<h1' in line:
                    break
                outfp.writeline(...)

或者,您可能希望使用HTML解析器(即BeautifulSoup)。然后你可以做类似这里描述的事情:https://stackoverflow.com/a/8735688/65295

从评论更新 - 基本上,一次读取整个文件,以便您可以根据需要自由地前后移动。这可能不会成为性能问题,除非你有一个非常大的文件(或非常少的内存)。

lines = infp.readlines() # read the entire file
i = 0
while i < len(lines): 
    if '<h1' in lines[i]:
        with open(...) as outfp:
            j = i + 1
            while j < len(lines):
                if '<h1' in lines[j]:
                    break
                outfp.writeline(lines[j])
        # line j has an <h1>, set i to j so we detect the it at the
        # top of the next loop iteration. 
        i = j
    else:
        i += 1

答案 2 :(得分:0)

我最终找到了上述问题的答案。下面的代码有很多只是获取文件头。它还同时加载两个具有格式化文件名数据(带扩展名)和纯标题名称数据的并行列表数组,因此我可以使用这些列表在一次点击中的while循环中填写这些html文件中的格式化文件扩展名。代码现在运行良好,如下所示。

def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []

inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:

    for line in infp:

        if '<h1' in line:                
            #strip html tags, convert to start caps
            p = re.compile(r'<.*?>')
            header = p.sub('', line)
            header = capwords(header)
            line_save = header

            # Add 0 for count below 10
            if count < 10: 
                header = '0' + str(count) + '_' + header
            else:
                header = str(count) + '_' + header              

            # remove all spaces + add extension in header
            header = header.replace(' ', '_')
            header = header + '.xhtml'
            count = count + 1

            #create two parallel lists used later 
            out_path = dir + os.sep + header
            outfp = open(out_path, 'wt', encoding=('utf-8'))
            file_path_names.insert(t_count, out_path)
            pure_header_names.insert(t_count, line_save)
            t_count = t_count + 1

            # Add html meta headers and write it 
            outfp = addMainHeaders(outfp)
            outfp.write(line)
            write_bodytext = True

        # add header bodytext   
        elif write_bodytext == True:
            outfp.write(line)

# now add html titles and close the html tails on all files    
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0

while i < max_num_files:
    outfp = open(tmp, 'wt', encoding=('utf-8'))     
    infp = open(file_path_names[i], 'rt', encoding=('utf-8'))

    for line in infp:
        if '<title>'  in line:
            line = line.strip(' ')
            line = line.replace('<title></title>', '<title>' +    pure_header_names[i] + '</title>')
            outfp.write(line)
        else:
            outfp.write(line)            

    # add the html tail
    if '</body>' in line or '</html>' in line:
        pass
    else:            
        outfp.write('  </body>' + '\n</html>')    

    # clean up
    infp.close()
    outfp.close()
    shutil.copy2(tmp, file_path_names[i])
    os.remove(tmp) 
    i = i + 1                

# now rename just the title page
if os.path.isfile(file_path_names[0]):    
    title_page_name = file_path_names[0]
    new_title_page_name = dir + os.sep + '01_Title.xhtml'    
    os.rename(title_page_name, new_title_page_name)
    file_path_names[0] = '01_Title.xhtml'
else:
    logmsg27(DEBUG_FLAG)
    os._exit(0) 

# xhtml file is no longer needed    
if os.path.isfile(inpath):
    os.remove(inpath)    

# returned list values are also used 
# later to create epub opf and ncx files
return(file_path_names, pure_header_names) 

@Hai Vu和@Seth - 感谢您的帮助。