我有一个包含5个不同文件夹的文件夹,其中每个文件夹有50个属于特定主题的电子邮件文档(因此,总共有5个主题/类)。
训练两个分类器-决策树和SVC(带有线性核)。报告10倍交叉验证的微观平均和宏观平均F1得分。您可能需要预处理数据,修剪决策树并为SVC找到合适的C值
您能为我提供包含微观平均和宏观平均F1得分的表格吗?
我尝试将每个文件夹的邮件放在一个txt文件中,但是当我执行决策树时,仍然不允许我这样做。
无法获得结果。
我应该将所有文件夹中的文件放入一个文本文件吗?
with open ("C:/Users/*******/DS Assign/toclassify/cwx.txt", "w") as outfile:
for f in files:
with open(f) as infile:
for line in infile:
outfile.write(line)
path = ("C:/Users/*******/DS Assign/toclassify/ra/*")
files = glob.glob(path)
#print(files)
with open ("C:/Users/*******/DS Assign/toclassify/ra.txt", "w") as outfile:
for f in files:
with open(f) as infile:
for line in infile:
outfile.write(line)
path = ("C:/Users/*******/DS Assign/toclassify/rsh/*")
files = glob.glob(path)
#print(files)
with open ("C:/Users/*******/DS Assign/toclassify/rsh.txt", "w") as outfile:
for f in files:
with open(f) as infile:
for line in infile:
outfile.write(line)
path = ("C:/Users/*******/DS Assign/toclassify/src/*")
files = glob.glob(path)
#print(files)
with open ("C:/Users/*******/DS Assign/toclassify/src.txt", "w") as outfile:
for f in files:
with open(f) as infile:
for line in infile:
outfile.write(line)
path = ("C:/Users/*******/DS Assign/toclassify/tpm/*")
files = glob.glob(path)
#print(files)
答案 0 :(得分:0)
import os
import pandas as pd
data_dir = os.path.join('.', 'data')
data_ids = []
data_txt = []
# Create a helper function to read the data from a particular folder and file
def get_data(file_name, folder_dir):
file_path = os.path.join(folder_dir, file_name)
return open(file_path, 'r').read()
# Loop through each folder in the data directory
for folder in os.listdir(data_dir):
# Create the folder directory from the data directory
folder_dir = os.path.join(data_dir, folder)
# Store the IDs of each file in the particular folder directory into a list
data_ids += os.listdir(folder_dir)
# Using list comprehension to create a list of the text contained in each file
# for a particular ID in the folder directory
data_txt += [get_data(data_id, folder_dir) for data_id in os.listdir(folder_dir)]
# Store into a Pandas dataframe for easy integration into modelling packages
df = pd.DataFrame({
'id': data_ids,
'text': data_txt
})