Question

我正在处理一个看起来像这样的数据集：

TRAIN DATA (JSON)
= [{
    "id": int,                      # ID of the Conversation
    "dialogues": [{                 # list of dialogues in the corpus
        "id": int,                  # id of dialogue
        "raw": string,              # raw text of the paragraph
        "sentences": [{             # list of sentences in the paragraph
            "start": int,
            "end": int,
            "tokens": [{            # list of tokens and tags in the sentence
                "id": int,          # index of the token in the document
                "start": int,
                "end": int,
                "dep": string,      # dependency label
                "head": int,        # offset of token head relative to token index
                "tag": string,      # part-of-speech tag
                "orth": string,     # verbatim text of the token
                "ner": string,       # BILUO label, e.g. "O" or "B-ORG"
                "synonyms": [],
                "antonyms": [],
            }],
            "chunks": [{
                "label": string,     # Noun Chunks
                "value": []          # Value
            }],
            "coref": [{
                "label": string,     # Coref tags
                "value": []          # Value
            }],
            "acts": [{
                "label": string,     # Act
                "start": int,
                "end": int,
                "value": []          # Value
            }],
            "intents": [{
                "label": string,     # Intent
                "start": int,
                "end": int,
                "value": []          # Value
            }]
        }],
        "speaker": int,             # speaker id
        "start": timestamp,         # start time
        "end": timestamp,           # end time
        "features": [{              # features for Dialogue Classifier
            "label": string,        # text feature label: WPM, Hold, Difficulty Index
            "value": float / bool   # label value
        }]
    }],
    "cats": [{                      # cats for Call Classifier
        "label": string,            # text category label
        "value": float / bool       # label applies (1.0/true) or not (0.0/false)
    }]
}]

此处提供了我要用来运行深度学习模型的教程： https://explosion.ai/blog/spacy-transformers

但是这里的数据形状要求是：（7，768）

并以这种方式处理：

texts, cats = zip(*batch)
        nlp.update(texts, cats, sgd=optimizer, losses=losses)

我使用此模型和数据类型的目的是为了对通话进行分类以保证质量，例如：为了双方的同理心水平，积极倾听顾虑等，这就是为什么我的数据形状为上面给出。

请让我知道您是否认为我应该从另一个角度来看这个问题。

我认为我擅长NLP的预处理和特征工程-对接我是深度学习的新手，因此我无法使用Spacy BERT模型自行配置。任何帮助，我们将不胜感激。

谢谢！

会话性Json数据作为Spatial BERT模型输入

0 个答案: