如何使用spacy逻辑分段句子?

时间:2017-11-30 16:59:16

标签: nlp spacy

我是Spacy的新手并尝试逻辑分段,以便我可以单独处理每个部分。 e.g;

"If the country selected is 'US', then the zip code should be numeric"

这需要分解为:

If the country selected is 'US',
then the zip code should be numeric

另一个有昏迷的句子不应该被打破:

The allowed states are NY, NJ and CT

任何想法,想法如何在spacy中做到这一点?

1 个答案:

答案 0 :(得分:0)

在使用自定义数据训练模型之前,我不确定是否可以这样做。但是spacy允许添加用于标记和句子分段等的规则。

以下代码在这种情况下可能很有用,您可以根据需要更改规则。

#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')

#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
           {'IS_ALPHA': True},
           {'ORTH': "'"}]


#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)


#Method to merge matched patterns
def special_merger(doc):
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:
        span.merge()
    return doc

#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
    for token in doc:
        if should_be_sentence_start(token):
            token.is_sent_start = True
    return doc

#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
    if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'"  :
        return True
    else:
        return False

#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')

#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
    print(sent)

输出:

If the country selected is 'US',
then the zip code should be numeric