如何格式化SpaCy训练数据集以进行文本分类?

时间:2020-09-05 04:16:10

标签: python list dictionary tuples spacy

我正在使用SpaCy库,我需要将数据集转换为以下预期输出。 但是,我得到此输出,但只填充了一堆零。语法是:

[('word', {'cats': {'label_1': 0, 'label_2': 1, ... }})]

预期输出

[
('hug',
  {'cats': {'anger': 0,
    'anticipation': 0,
    'disgust': 0,
    'fear': 0,
    'joy': 1,
    'negative': 0,
    'positive': 0,
    'sadness': 0}}),

 ('cry',
  {'cats': {'anger': 0,
    'anticipation': 0,
    'disgust': 0,
    'fear': 0,
    'joy': 0,
    'negative': 0,
    'positive': 0,
    'sadness': 1}}),
...
]

此功能将遍历标签列表

def cat_dict_funct(cat_dict, lst, n):
    for i in range(8):
        if i == n:
            cat_dict[lst[i]] = 1
        else:
            cat_dict[lst[i]] = 0


初始化数据和标签

train_data = df
train_labels = list(set(df.category))

['anger',
 'fear',
 'disgust',
 'positive',
 'sadness',
 'anticipation',
 'joy',
 'negative',
 'surprise',
 'trust']

遍历每个标签并以正确的顺序附加项目

train_texts = train_data['word'].tolist()
train_cats = train_data['category'].tolist()

final_train_cats, cat_dict = [], {}

for cat in train_cats:
    if cat == 'anger':
        cat_dict_funct(cat_dict, train_labels, 0)
    elif cat == 'fear':
        cat_dict_funct(cat_dict, train_labels, 1)
    elif cat == 'disgust':
        cat_dict_funct(cat_dict, train_labels, 2)
    elif cat == 'positive':
        cat_dict_funct(cat_dict, train_labels, 3)
    elif cat == 'sadness':
        cat_dict_funct(cat_dict, train_labels, 4)
    elif cat == 'anticipation':
        cat_dict_funct(cat_dict, train_labels, 5)
    elif cat == 'joy':
        cat_dict_funct(cat_dict, train_labels, 6)
    elif cat == 'negative':
        cat_dict_funct(cat_dict, train_labels, 7)
    elif cat == 'surprise':
        cat_dict_funct(cat_dict, train_labels, 8)
    elif cat == 'trust':
        cat_dict_funct(cat_dict, train_labels, 9)
    final_train_cats.append(cat_dict)

压缩并列出收集的项目

TRAIN_DATA = list(zip(train_texts, [{"cats": cats} for cats in final_train_cats]))

0 个答案:

没有答案