如何训练sense2vec模型

时间:2016-06-21 13:36:29

标签: python word2vec spacy

sense2vec的文档提到了3个主要文件 - 第一个是merge_text.py。我尝试了几种类型的input-txt,csv,bzipped文件,因为merge_text.py试图打开由bzip2压缩的文件。

该文件可在以下位置找到: https://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

此脚本需要什么类型的输入格式? 此外,如果有人可以请建议如何训练模型。

2 个答案:

答案 0 :(得分:4)

我扩展并调整了sense2vec的代码示例。

你可以从这个输入文字:

"就沙特阿拉伯及其动机而言,这也很简单。沙特人 善于金钱和算术。面对亏本的痛苦选择 保持目前的产量为每桶60美元或减少200万桶 每天离开市场并损失更多的钱 - 这是一个简单的选择:采取 这条路不那么痛苦。如果有像伤害美国这样的次要原因 紧张的石油生产国或伤害伊朗和俄罗斯,这很好,但它确实如此 只是钱。"

对此:

as | ADV far | ADV as | ADP saudi_arabia | ENT and | CCONJ its | ADJ motif | NOUN that | ADJ is | VERB very | ADV simple | ADJ也| ADV saudis | ENT | VERB good | ADJ at | ADP资金| NOUN和| CCONJ算术| NOUN面临| VERB与| ADP painful_choice | NOUN | ADP失败| VERB资金| NOUN维持| VERB当前生产| NOUN at | ADP us $ | SYM 60 | MONEY per | ADP桶| NOUN S |动词easy_choice |名词拿|动词路径;或者| CCONJ回吐|动词two_million |红衣主教桶| ADP天| |名词关闭| ADP市场|名词和| CCONJ失去|动词much_more_money | |名词它PRON&#39元名词|名词,| ADJ是|动词少| ADV痛苦| ADJ如果| ADP存在| ADV是|动词secondary_reason | ADP伤害| |比如名词动词我们| ENT tight_oil_producer |名词或| CCONJ伤害|动词伊朗|五官科| CCONJ俄罗斯| ENT' s | VERB伟大| ADJ但是| CCONJ it | PRON' s | VERB真的| ADV只是| ADV关于| ADP资金| NOUN

  • 双重换行符被解释为单独的文档。
  • Urls被识别为,被剥离到domain.tld并标记为| URL
  • 名词(也是名词是名词短语的一部分)被词形化(因为动机变成了图案)
  • 删除带有POS标签的单词,如DET(确定文章)和PUNCT(用于标点符号)

这是代码。如果您有疑问,请告诉我。

我很快就会在github.com/woltob上发布它。

import spacy
import re

nlp = spacy.load('en')
nlp.matcher = None

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'PERSON',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}

pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')

def strip_meta(text):
    text = text.replace('per cent', 'percent')
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = pre_format_re.sub('', text)
    text = post_format_re.sub('', text)
    text = double_linebreak_re.sub('{2break}', text)
    text = single_linebreak_re.sub(' ', text)
    text = text.replace('{2break}', '\n')
    text = whitespace_re.sub(' ', text)
    text = quote_re.sub('', text)
    return text

def transform_doc(doc):
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
    for np in doc.noun_chunks:
        while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
            np = np[1:]
        np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for sent in doc.sents:
        sentence = []
        if sent.text.strip():
            for w in sent:
                if w.is_space:
                    continue
                w_ = represent_word(w)
                if w_:
                    sentence.append(w_)
            strings.append(' '.join(sentence))
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''


def represent_word(word):
    if word.like_url:
        x = url_re.search(word.text.strip().lower())
        if x:
            return x.group(3)+'|URL'
        else:
            return word.text.lower().strip()+'|URL?'
    text = re.sub(r'\s', '_', word.text.strip().lower())
    tag = LABELS.get(word.ent_type_)
    # Dropping PUNCTUATION such as commas and DET like the
    if tag is None and word.pos_ not in ['PUNCT', 'DET']:
        tag = word.pos_
    elif tag is None:
        return None
    # if not word.pos_:
    #    tag = '?'
    return text + '|' + tag

corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
    # only lemmatize NOUN and PROPN
    if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
        # Keep the original word with the length of the lemma, then add the white space, if it was there.:
        lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
            # print(word.text, lemma_)
        corpus_.append(lemma_)
    # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
    # All other words are added normally.
    else:
        corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w') 
file.write(result)  
file.close() 
print(result)

您可以使用此方法使用Tensorboard中的Gensim可视化您的模型: https://github.com/ArdalanM/gensim2tensorboard

我还会调整此代码以使用sense2vec方法(例如,在预处理步骤中单词变为小写,只需在代码中注释掉它。)

快乐的编码, woltob

答案 1 :(得分:0)

输入文件应该是一个bzip压缩包。要使用纯文本文件,只需编辑merge_text.py,如下所示:

def iter_comments(loc):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield line.decode('utf-8', errors='ignore')
            # yield ujson.loads(line)['body']