我正在将文本直接转换为epub,而且我自动将HTML book文件拆分为单独的header / chapter文件。目前,下面的代码部分工作,但只创建每隔一个章节文件。因此输出中缺少一半的头/章文件。这是代码:
def splitHeaderstoFiles(fpath):
infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:
# format and split headers to files
if '<h1' in line:
#-----------format header file names and other stuff ------------#
# create a new file for the header/chapter section
path = os.getcwd() + os.sep + header
with open(path, 'wt', encoding=('utf-8')) as outfp:
# write html top meta headers
outfp = addMetaHeaders(outfp)
# add the header
outfp = outfp.write(line)
# add the chapter/header bodytext
for line in infp:
if '<h1' not in line:
outfp.write(line)
else:
outfp.write('</body>\n</html>')
break
else:
continue
infp.close()
问题发生在代码底部的第二个“for循环”中,当我查找下一个h1标记以停止分割时。我不能使用seek()或tell()来回退或向后移动一行,以便程序可以在下一次迭代中找到下一个标题/章节。显然你不能在包含隐式iter或操作中的下一个对象的for循环中在python中使用它们。只是给出一个'不能做非零相对寻找'的错误。
我还在代码中尝试了 while line!=''+ readline()组合,这也给出了与上面相同的错误。
有没有人知道将不同长度的HTML标题/章节拆分为python中的单独文件的简单方法?是否有任何特殊的python模块(如泡菜)可以帮助简化这项任务?
我正在使用Python 3.4
我对此问题的解决方案提前表示感谢...
答案 0 :(得分:2)
前一段时间我遇到过类似的问题,这是一个简化的解决方案:
from itertools import count
chapter_number = count(1)
output_file = open('000-intro.html', 'wb')
with open('index.html', 'rt') as input_file:
for line in input_file:
if '<h1' in line:
output_file.close()
output_file = open('{:03}-chapter'.format(next(chapter_number)), 'wb')
output_file.write(line)
output_file.close()
在这种方法中,导致第一个h1
块的第一个文本块被写入 000-intro.html ,第一章将被写入 001- chapter.html 等。请修改它以品尝。
解决方案很简单:遇到h1
标记后,关闭最后一个输出文件并打开一个新文件。
答案 1 :(得分:0)
您正在循环输入文件两次,这可能会导致您的问题:
for line in infp:
...
with open(path, 'wt', encoding=('utf-8')) as outfp:
...
for line in infp:
...
每个for都将拥有它自己的迭代器,所以你要多次遍历文件。
您可能会尝试将for循环转换为一段时间,因此您不会使用两个不同的迭代器:
while infp:
line = infp.readline()
if '<h1' in line:
with open(...) as outfp:
while infp:
line = infp.readline()
if '<h1' in line:
break
outfp.writeline(...)
或者,您可能希望使用HTML解析器(即BeautifulSoup)。然后你可以做类似这里描述的事情:https://stackoverflow.com/a/8735688/65295。
从评论更新 - 基本上,一次读取整个文件,以便您可以根据需要自由地前后移动。这可能不会成为性能问题,除非你有一个非常大的文件(或非常少的内存)。
lines = infp.readlines() # read the entire file
i = 0
while i < len(lines):
if '<h1' in lines[i]:
with open(...) as outfp:
j = i + 1
while j < len(lines):
if '<h1' in lines[j]:
break
outfp.writeline(lines[j])
# line j has an <h1>, set i to j so we detect the it at the
# top of the next loop iteration.
i = j
else:
i += 1
答案 2 :(得分:0)
我最终找到了上述问题的答案。下面的代码有很多只是获取文件头。它还同时加载两个具有格式化文件名数据(带扩展名)和纯标题名称数据的并行列表数组,因此我可以使用这些列表在一次点击中的while循环中填写这些html文件中的格式化文件扩展名。代码现在运行良好,如下所示。
def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []
inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:
for line in infp:
if '<h1' in line:
#strip html tags, convert to start caps
p = re.compile(r'<.*?>')
header = p.sub('', line)
header = capwords(header)
line_save = header
# Add 0 for count below 10
if count < 10:
header = '0' + str(count) + '_' + header
else:
header = str(count) + '_' + header
# remove all spaces + add extension in header
header = header.replace(' ', '_')
header = header + '.xhtml'
count = count + 1
#create two parallel lists used later
out_path = dir + os.sep + header
outfp = open(out_path, 'wt', encoding=('utf-8'))
file_path_names.insert(t_count, out_path)
pure_header_names.insert(t_count, line_save)
t_count = t_count + 1
# Add html meta headers and write it
outfp = addMainHeaders(outfp)
outfp.write(line)
write_bodytext = True
# add header bodytext
elif write_bodytext == True:
outfp.write(line)
# now add html titles and close the html tails on all files
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0
while i < max_num_files:
outfp = open(tmp, 'wt', encoding=('utf-8'))
infp = open(file_path_names[i], 'rt', encoding=('utf-8'))
for line in infp:
if '<title>' in line:
line = line.strip(' ')
line = line.replace('<title></title>', '<title>' + pure_header_names[i] + '</title>')
outfp.write(line)
else:
outfp.write(line)
# add the html tail
if '</body>' in line or '</html>' in line:
pass
else:
outfp.write(' </body>' + '\n</html>')
# clean up
infp.close()
outfp.close()
shutil.copy2(tmp, file_path_names[i])
os.remove(tmp)
i = i + 1
# now rename just the title page
if os.path.isfile(file_path_names[0]):
title_page_name = file_path_names[0]
new_title_page_name = dir + os.sep + '01_Title.xhtml'
os.rename(title_page_name, new_title_page_name)
file_path_names[0] = '01_Title.xhtml'
else:
logmsg27(DEBUG_FLAG)
os._exit(0)
# xhtml file is no longer needed
if os.path.isfile(inpath):
os.remove(inpath)
# returned list values are also used
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)
@Hai Vu和@Seth - 感谢您的帮助。