如何使用 bert 的预训练模型“更快”将文本转换为词嵌入?

时间:2020-12-29 15:48:40

标签: python-3.x nlp word-embedding bert-language-model

我正在尝试使用 microsoft/pubmedbert 获取临床数据的词嵌入。我有 360 万个文本行。将文本转换为 10k 行的向量需要大约 30 分钟。因此,对于 360 万行,大约需要 - 180 小时(大约 8 天)。

<块引用>

有什么方法可以加快进程吗?

我的代码 -

from transformers import AutoTokenizer
from transformers import pipeline
model_name = "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline('feature-extraction',model=model_name, tokenizer=tokenizer)

def lambda_func(row):
    tokens = tokenizer(row['notetext'])
    if len(tokens['input_ids'])>512:
        tokens = re.split(r'\b', row['notetext'])
        tokens= [t for t in tokens if len(t) > 0 ]
        row['notetext'] = ''.join(tokens[:512])
    row['vectors'] = classifier(row['notetext'])[0][0]        
    return row

def process(progress_notes):     
    progress_notes = progress_notes.apply(lambda_func, axis=1)
    return progress_notes

progress_notes = process(progress_notes)
vectors_breadth = 768
vectors_length = len(progress_notes)
vectors_2d = np.reshape(progress_notes['vectors'].to_list(), (vectors_length, vectors_breadth))
vectors_df = pd.DataFrame(vectors_2d)

我的 progress_notes 数据框看起来像 -

progress_notes = pd.DataFrame({'id':[1,2,3],'progressnotetype':['Nursing Note', 'Nursing Note', 'Administration Note'], 'notetext': ['Patient\'s skin is grossly intact with exception of skin tear to r inner elbow and r lateral lower leg','Patient with history of Afib with RVR. Patient is incontinent of bowel and bladder.','Give 2 tablet by mouth every 4 hours as needed for Mild to moderate Pain Not to exceed 3 grams in 24 hours']})
<块引用>

注意 - 1) 我在 aws ec2 实例 r5.8x large(32 CPU)-我尝试使用多处理,但代码进入 死锁,因为 bert 占用了我所有的 CPU 内核。

0 个答案:

没有答案