使用RegEX剪切纯HTML文件

时间:2016-07-11 15:22:45

标签: python-3.x pycharm extract html-encode utf

我正在使用此代码提取部分本地存储的HTML文件,并将缩短的新文档保存到.txt文件中。

import glob
import os
import re


def extractor():
    os.chdir(r"F:\Test")  # the directory containing your html
    for file in glob.iglob("*.html"):  # iterates over all files in the directory ending in .html
        with open(file, encoding="utf8") as f, open((file.rsplit(".", 1)[0]) + ".txt", "w", encoding="utf8") as out:
            contents = f.read()
            extract = re.compile(r'(Start).*?End', re.I | re.S)
            cut = extract.sub('', contents)
            if re.search(extract, contents) is not None:
                out.write(cut)
            out.close()
extractor()

它适用于我的大多数文件,但是对于一些文件,我确实有一些编码问题并得到:

Traceback (most recent call last):
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 16, in <module>
    extractor()
  File "C:/Users/6930p/PycharmProjects/untitled/Versuch/CutFile.py", line 14, in extractor
    out.write(cut)
  File "C:\Users\6930p\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 241205-241210: character maps to <undefined>

任何人都知道问题是什么?我想通过使用encoding="utf8"我对编码没有任何问题......

任何帮助表示赞赏!

1 个答案:

答案 0 :(得分:0)

好的,这是encoding="utf8"的一个问题。它忘了用"utf8"编码我新创建的.txt文件。代码已更新并且有效!