我是Spacy的新手并尝试逻辑分段,以便我可以单独处理每个部分。 e.g;
"If the country selected is 'US', then the zip code should be numeric"
这需要分解为:
If the country selected is 'US',
then the zip code should be numeric
另一个有昏迷的句子不应该被打破:
The allowed states are NY, NJ and CT
任何想法,想法如何在spacy中做到这一点?
答案 0 :(得分:0)
在使用自定义数据训练模型之前,我不确定是否可以这样做。但是spacy允许添加用于标记和句子分段等的规则。
以下代码在这种情况下可能很有用,您可以根据需要更改规则。
#Importing spacy and Matcher to merge matched patterns
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en')
#Defining pattern i.e any text surrounded with '' should be merged into single token
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
{'IS_ALPHA': True},
{'ORTH': "'"}]
#Adding pattern to the matcher
matcher.add('special_merger', None, pattern)
#Method to merge matched patterns
def special_merger(doc):
matched_spans = []
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
matched_spans.append(span)
for span in matched_spans:
span.merge()
return doc
#To determine whether a token can be start of the sentence.
def should_sentence_start(doc):
for token in doc:
if should_be_sentence_start(token):
token.is_sent_start = True
return doc
#Defining rule such that, if previous toke is "," and previous to previous token is "'US'"
#Then current token should be start of the sentence.
def should_be_sentence_start(token):
if token.i >= 2 and token.nbor(-1).text == "," and token.nbor(-2).text == "'US'" :
return True
else:
return False
#Adding matcher and sentence tokenizing to nlp pipeline.
nlp.add_pipe(special_merger, first=True)
nlp.add_pipe(should_sentence_start, before='parser')
#Applying NLP on requried text
sent_texts = "If the country selected is 'US', then the zip code should be numeric"
doc = nlp(sent_texts)
for sent in doc.sents:
print(sent)
输出:
If the country selected is 'US',
then the zip code should be numeric