Question

我是Spacy的新手并尝试逻辑分段，以便我可以单独处理每个部分。 e.g;

"If the country selected is 'US', then the zip code should be numeric"

这需要分解为：

If the country selected is 'US',
then the zip code should be numeric

另一个有昏迷的句子不应该被打破：

The allowed states are NY, NJ and CT

任何想法，想法如何在spacy中做到这一点？

Answer 1

在使用自定义数据训练模型之前，我不确定是否可以这样做。但是spacy允许添加用于标记和句子分段等的规则。

以下代码在这种情况下可能很有用，您可以根据需要更改规则。

#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')

#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
           {'IS_ALPHA': True},
           {'ORTH': "'"}]


#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)


#Method to merge matched patterns
def special_merger(doc):
    matched_spans = []
    matches = matcher(doc)
    for match_id, start, end in matches:
        span = doc[start:end]
        matched_spans.append(span)
    for span in matched_spans:
        span.merge()
    return doc

#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
    for token in doc:
        if should_be_sentence_start(token):
            token.is_sent_start = True
    return doc

#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
    if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'"  :
        return True
    else:
        return False

#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')

#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
    print(sent)

输出：

If the country selected is 'US',
then the zip code should be numeric

如何使用spacy逻辑分段句子？

1 个答案: