我正在处理一个看起来像这样的数据集:
TRAIN DATA (JSON)
= [{
"id": int, # ID of the Conversation
"dialogues": [{ # list of dialogues in the corpus
"id": int, # id of dialogue
"raw": string, # raw text of the paragraph
"sentences": [{ # list of sentences in the paragraph
"start": int,
"end": int,
"tokens": [{ # list of tokens and tags in the sentence
"id": int, # index of the token in the document
"start": int,
"end": int,
"dep": string, # dependency label
"head": int, # offset of token head relative to token index
"tag": string, # part-of-speech tag
"orth": string, # verbatim text of the token
"ner": string, # BILUO label, e.g. "O" or "B-ORG"
"synonyms": [],
"antonyms": [],
}],
"chunks": [{
"label": string, # Noun Chunks
"value": [] # Value
}],
"coref": [{
"label": string, # Coref tags
"value": [] # Value
}],
"acts": [{
"label": string, # Act
"start": int,
"end": int,
"value": [] # Value
}],
"intents": [{
"label": string, # Intent
"start": int,
"end": int,
"value": [] # Value
}]
}],
"speaker": int, # speaker id
"start": timestamp, # start time
"end": timestamp, # end time
"features": [{ # features for Dialogue Classifier
"label": string, # text feature label: WPM, Hold, Difficulty Index
"value": float / bool # label value
}]
}],
"cats": [{ # cats for Call Classifier
"label": string, # text category label
"value": float / bool # label applies (1.0/true) or not (0.0/false)
}]
}]
此处提供了我要用来运行深度学习模型的教程: https://explosion.ai/blog/spacy-transformers
但是这里的数据形状要求是: (7,768)
并以这种方式处理:
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
我使用此模型和数据类型的目的是为了对通话进行分类以保证质量,例如:为了双方的同理心水平,积极倾听顾虑等,这就是为什么我的数据形状为上面给出。
请让我知道您是否认为我应该从另一个角度来看这个问题。
我认为我擅长NLP的预处理和特征工程-对接我是深度学习的新手,因此我无法使用Spacy BERT模型自行配置。任何帮助,我们将不胜感激。
谢谢!