当我尝试打开html文件的文件夹时,出现utf-8错误

时间:2019-05-03 17:27:46

标签: python utf-8 beautifulsoup directory codec

我正在尝试使用html文件列表在PC上打开现有的本地文件夹("train")。我想将它们转换为txt,并将生成的文件放在另一个名为"traintxt"的文件夹中。我写了一些代码,但我继续使用unicodeDecodeError。我该如何解决?如果我做不到,又怎么可以用另一种方式做同样的事情呢?

import glob
import os.path
from bs4 import BeautifulSoup

dir_path = r"/Users/martinagalletti/Desktop/parte 2 data mining/train/student"
results_dir = r"/Users/martinagalletti/Desktop/parte 2 data mining/train/studenttxt"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    with open(file_name, encoding='utf-8') as html_file:
        soup = BeautifulSoup(html_file)

results_file = os.path.splitext(file_name)[0] + '.txt'
with open(results_file, 'w') as outfile:        
    for i in soup.select('font[color="#FF0000"]'):
        print(i.text)
        outfile.write(i.text + '\n')

0 个答案:

没有答案