我正在尝试使用html文件列表在PC上打开现有的本地文件夹("train"
)。我想将它们转换为txt,并将生成的文件放在另一个名为"traintxt"
的文件夹中。我写了一些代码,但我继续使用unicodeDecodeError
。我该如何解决?如果我做不到,又怎么可以用另一种方式做同样的事情呢?
import glob
import os.path
from bs4 import BeautifulSoup
dir_path = r"/Users/martinagalletti/Desktop/parte 2 data mining/train/student"
results_dir = r"/Users/martinagalletti/Desktop/parte 2 data mining/train/studenttxt"
for file_name in glob.glob(os.path.join(dir_path, "*.html")):
with open(file_name, encoding='utf-8') as html_file:
soup = BeautifulSoup(html_file)
results_file = os.path.splitext(file_name)[0] + '.txt'
with open(results_file, 'w') as outfile:
for i in soup.select('font[color="#FF0000"]'):
print(i.text)
outfile.write(i.text + '\n')