连字,十进制数字和撇号的标记化

时间:2019-09-25 20:42:19

标签: python spacy

我正在尝试对假设为单个令牌的带标记的单词进行后处理,这些单词被标记为单独的标记。例如:

Example:

Sentence: "up-scaled"
Tokens: ['up', '-', 'scaled']
Expected: ['up-scaled']

目前,我的解决方案是使用匹配器:

matcher = Matcher(nlp.vocab)
pattern = [{'IS_ALPHA': True, 'IS_SPACE': False},
           {'ORTH': '-'},
           {'IS_ALPHA': True, 'IS_SPACE': False}]

matcher.add('HYPHENATED', None, pattern)

def quote_merger(doc):
    # this will be called on the Doc object in the pipeline
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:  # merge into one token after collecting all matches
        span.merge()
    #print(doc)
    return doc

nlp.add_pipe(quote_merger, first=True)  # add it right after the tokenizer
doc = nlp(text)

但是,这将导致以下预期问题:

Example 2:

Sentence: "I know I will be back - I had a very pleasant time"
Tokens: ['i', 'know', 'I', 'will', 'be', 'back - I', 'had', 'a', 'very', 'pleasant', 'time']
Expected: ['i', 'know', 'I', 'will', 'be', 'back', '-', 'I', 'had', 'a', 'very', 'pleasant', 'time']

有没有办法我只能处理用连字符分隔的单词,且字符之间没有空格?这样像“高档”这样的词将被匹配并组合成一个令牌,而不是“ .. back-I ..”

我已经尝试过发布的解决方案:Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

但是,我没有使用此解决方案,因为它导致带有撇号(')的单词和带有小数的数字的错误分词:

Sentence: "It's"
Tokens: ["I", "t's"]
Expected: ["It", "'s"]

Sentence: "1.50"
Tokens: ["1", ".", "50"]
Expected: ["1.50"]

这就是为什么我使用Matcher而不是尝试编辑正则表达式的原因。

0 个答案:

没有答案