我正在编写一个文件,结合文件夹中的所有文件。我希望文本文件是UTF-8编码的。我的代码如下
import os
import codecs
import re
def file_concatenation(path):
with codecs.open('C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt', 'w',encoding='utf8') as outfile:
for root, dirs, files in os.walk(path):
for dir_name in dirs:
for fname in os.listdir(root+"/"+dir_name):
with open(root+"/"+dir_name+"/"+fname) as infile:
for line in infile:
new_line = re.sub('[^a-zA-Z]', ' ',line)
outfile.write(re.sub("\s\s+", " ", new_line.lstrip()))
file_concatenation('C:/Users/JAYASHREE/Documents/NLP/bbc-fulltext/bbc')
当我使用chardetect查找我的编码时,它显示为ASCII,置信度为1.0
C:\Users\JAYASHREE>chardetect "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt"
C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt: ascii with confidence 1.0
请妥善解决问题。 感谢