我有一个基于nltk.pos_tag的功能,该功能可以从文本中过滤出仅用于形容词(JJ)和名词(NN)的搭配。
f1=u'this is my random text'
tokens = word_tokenize(f1)
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(tokens)
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)
print(bigramFreqTable)
def rightTypes(ngram):
first_type = ('JJ')
second_type = ('NN')
tags = nltk.pos_tag(ngram)
if tags[0][1] in first_type and tags[1][1] in second_type:
return True
else:
return False
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]
print(filtered_bi)
我想使用spacy
方法而不是nltk.pos_tag
。以下是spacy
文档中的示例代码。
import spacy
from spacy.lang.en.examples import sentences
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentences[0])
print(doc.text)
for token in doc:
print(token.text, token.pos_)
我尝试了不同的解决方案,例如tags=[(X.text, X.tag_) for Y in nlp(ngram).ents for X in Y]
,但有错误...您能帮忙使用spacy而不是nltk吗?
答案 0 :(得分:1)
使用spaCy的Matcher
,您可以create custom rules与之匹配。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", None, pattern)
doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
您可以使用以下模式:
[{"POS": "JJ"}, {"POS": NN}]
满足您的要求。