如果我改用非英语模型,LSTM的性能会很差

时间:2019-11-20 16:42:48

标签: keras nlp regression lstm

我正在使用以下存储库1中的代码,该存储库是针对英语数据集Asap的自动作文评分。

我打算从以下存储库2中为葡萄牙文打分。因此,我在代码中更改的唯一方法是load_data方法,现在看起来像这样:

def load_data(dataset_directory, train_size=0.8, validation_size=0.2):
"""
Loads data from csv file and divides it into test, train and validation.
"""

essays_dct = {"essay_id":[], "essay_set":[], "essay":[], "domain1_score":[]}
id_ = 0
essay_set = 0
for (dirpath, dirnames, filenames) in os.walk(DATASET_DIR):
    for f in filenames:
        if f.endswith(".xml") and f != "prompt.xml":

            full_name = os.path.join(dirpath, f)
            doc = Document(full_name)
            doc.read()

            txt_ = ""            
            if doc.get_generalcomment() is not None:
                txt_ = txt_ + doc.get_generalcomment()

            if doc.get_specificaspects() is not None:
                txt_ = txt_ + doc.get_specificaspects()

            txt_ = unidecode.unidecode(txt_.lower())

            essays_dct["essay_id"].append(id_)
            essays_dct["essay_set"].append(essay_set)
            essays_dct["essay"].append(txt_)
            essays_dct["domain1_score"].append(float(doc.get_finalgrade().replace(",","."))) 

            id_ += 1               

essays = pd.DataFrame.from_dict(essays_dct)
essays_training, essays_test = train_test_split(essays, train_size=train_size, random_state=0)
essays_train, essays_cv = train_test_split(essays_training, test_size=validation_size)

return essays_train, essays_cv, essays_test

我使用了来自以下3组的经过预训练的手套嵌入。但是,无论我采用哪种方法(lstm,cnn,bi-lstm),我都无法达到大于0.1的kappa,这非常低。我该怎么办?

0 个答案:

没有答案