我正在使用以下存储库1中的代码,该存储库是针对英语数据集Asap的自动作文评分。
我打算从以下存储库2中为葡萄牙文打分。因此,我在代码中更改的唯一方法是load_data方法,现在看起来像这样:
def load_data(dataset_directory, train_size=0.8, validation_size=0.2):
"""
Loads data from csv file and divides it into test, train and validation.
"""
essays_dct = {"essay_id":[], "essay_set":[], "essay":[], "domain1_score":[]}
id_ = 0
essay_set = 0
for (dirpath, dirnames, filenames) in os.walk(DATASET_DIR):
for f in filenames:
if f.endswith(".xml") and f != "prompt.xml":
full_name = os.path.join(dirpath, f)
doc = Document(full_name)
doc.read()
txt_ = ""
if doc.get_generalcomment() is not None:
txt_ = txt_ + doc.get_generalcomment()
if doc.get_specificaspects() is not None:
txt_ = txt_ + doc.get_specificaspects()
txt_ = unidecode.unidecode(txt_.lower())
essays_dct["essay_id"].append(id_)
essays_dct["essay_set"].append(essay_set)
essays_dct["essay"].append(txt_)
essays_dct["domain1_score"].append(float(doc.get_finalgrade().replace(",",".")))
id_ += 1
essays = pd.DataFrame.from_dict(essays_dct)
essays_training, essays_test = train_test_split(essays, train_size=train_size, random_state=0)
essays_train, essays_cv = train_test_split(essays_training, test_size=validation_size)
return essays_train, essays_cv, essays_test
我使用了来自以下3组的经过预训练的手套嵌入。但是,无论我采用哪种方法(lstm,cnn,bi-lstm),我都无法达到大于0.1的kappa,这非常低。我该怎么办?