我试图了解如何为ELMo向量化准备段落。
docs仅显示如何同时嵌入多个句子/单词。
例如
sentences = [["the", "cat", "is", "on", "the", "mat"],
["dogs", "are", "in", "the", "fog", ""]]
elmo(
inputs={
"tokens": sentences,
"sequence_len": [6, 5]
},
signature="tokens",
as_dict=True
)["elmo"]
据我了解,这将返回2个向量,每个向量代表一个给定的句子。 我将如何准备输入数据以矢量化包含多个句子的整个段落。请注意,我想使用自己的预处理程序。
可以这样做吗?
sentences = [["<s>" "the", "cat", "is", "on", "the", "mat", ".", "</s>",
"<s>", "dogs", "are", "in", "the", "fog", ".", "</s>"]]
或者也许是这样?
sentences = [["the", "cat", "is", "on", "the", "mat", ".",
"dogs", "are", "in", "the", "fog", "."]]
答案 0 :(得分:0)
ELMo产生上下文词向量。因此,与单词相对应的单词向量是单词和上下文(例如句子)在其中出现的函数。
就像您在文档中的示例一样,您希望您的段落是句子列表,即标记列表。所以你的第二个例子。要获得这种格式,您可以使用spacy
tokenizer
import spacy
# you need to install the language model first. See spacy docs.
nlp = spacy.load('en_core_web_sm')
text = "The cat is on the mat. Dogs are in the fog."
toks = nlp(text)
sentences = [[w.text for w in s] for s in toks.sents]
我认为您不需要在第二句话上加上""
,因为sequence_len
可以解决这个问题。
更新:
据我了解,这将返回2个向量,每个向量代表一个给定的句子
否,这将为每个句子中的每个单词返回一个向量。如果要将整个段落作为上下文(每个单词),只需将其更改为
sentences = [["the", "cat", "is", "on", "the", "mat", "dogs", "are", "in", "the", "fog"]]
和
...
"sequence_len": [11]