我有一些文字例如说:80% of $300,000 Each Human Resource/IT Department.
我需要提取$300,000
以及Each Human Resource/IT Department
我在标记化后使用了pos标记来标记单词。我能够提取300,000但不能随之提取$符号。
到目前为止我所拥有的:
text = '80% of $300,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenseTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
转换为列表时 chunked输出 - ['80 %', '300,000', 'Each Human Resource/IT Department']
我想要的是:['80 %', '**$**300,000', 'Each Human Resource/IT Department']
我试过
chunkGram = r"""chunk: {**</$CD>|**<DT>+<NN.*>+<NN.*>?|<NNP>?|<CD>+<NN>?|
}&#34;&#34;&#34;
它仍然无法运作。所以,我需要的是 $ 以及 CD
答案 0 :(得分:1)
您需要添加&lt; \ $&gt ;?在你的语法中。
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
代码:
import nltk
from nltk.tokenize import PunktSentenceTokenizer
text = '80% of $300,000 Each Human Resource/IT Department'
train_text = text
sample_text = text
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
chunkGram = r"""chunk: {<DT>+<NN.*>+<NN.*>?|<\$>?<CD>+<NN>?|<NNP>?}"""
chunkParser = nltk.RegexpParser(chunkGram)
chunked = chunkParser.parse(tagged)
print(chunked)
输出
(S
(chunk 80/CD %/NN)
of/IN
(chunk $/$ 300,000/CD)
(chunk Each/DT Human/NNP Resource/IT/NNP Department/NNP))