我正在对给定的文档进行情感分析,我的目标是要找出与我的句子中的目标短语最接近或最接近的形容词。我确实有一个想法,该如何提取与目标短语相关的周围单词,但是如何找出与目标短语相关的相对接近或最接近的形容词或NNP
或VBN
或其他POS标签。
这是一个草图概念,说明如何使周围的单词尊重我的目标短语。
sentence_List= {"Obviously one of the most important features of any computer is the human interface.", "Good for everyday computing and web browsing.",
"My problem was with DELL Customer Service", "I play a lot of casual games online[comma] and the touchpad is very responsive"}
target_phraseList={"human interface","everyday computing","DELL Customer Service","touchpad"}
请注意,我的原始数据集是作为数据框给出的,其中给出了句子列表和相应的目标短语。在这里,我只是模拟数据,如下所示:
import pandas as pd
df=pd.Series(sentence_List, target_phraseList)
df=pd.DataFrame(df)
在这里我将句子标记化如下:
from nltk.tokenize import word_tokenize
tokenized_sents = [word_tokenize(i) for i in sentence_List]
tokenized=[i for i in tokenized_sents]
然后,我尝试使用此loot at here来找出与目标短语相关的周围单词。但是,我想找出相对于我的目标短语而言相对adjective
或verbs
或VBN
相对更近的地方。我怎样才能做到这一点?有什么想法可以完成吗?谢谢
答案 0 :(得分:2)
您会喜欢以下工作吗?我认识到需要进行一些调整才能使此功能完全有用(检查大小写;如果有平局,它还会返回句子前面的单词,而不是后面的单词),但希望它是有用的足以让您入门:
import nltk
from nltk.tokenize import MWETokenizer
def smart_tokenizer(sentence, target_phrase):
"""
Tokenize a sentence using a full target phrase.
"""
tokenizer = MWETokenizer()
target_tuple = tuple(target_phrase.split())
tokenizer.add_mwe(target_tuple)
token_sentence = nltk.pos_tag(tokenizer.tokenize(sentence.split()))
# The MWETokenizer puts underscores to replace spaces, for some reason
# So just identify what the phrase has been converted to
temp_phrase = target_phrase.replace(' ', '_')
target_index = [i for i, y in enumerate(token_sentence) if y[0] == temp_phrase]
if len(target_index) == 0:
return None, None
else:
return token_sentence, target_index[0]
def search(text_tag, tokenized_sentence, target_index):
"""
Search for a part of speech (POS) nearest a target phrase of interest.
"""
for i, entry in enumerate(tokenized_sentence):
# entry[0] is the word; entry[1] is the POS
ahead = target_index + i
behind = target_index - i
try:
if (tokenized_sentence[ahead][1]) == text_tag:
return tokenized_sentence[ahead][0]
except IndexError:
try:
if (tokenized_sentence[behind][1]) == text_tag:
return tokenized_sentence[behind][0]
except IndexError:
continue
x, i = smart_tokenizer(sentence='My problem was with DELL Customer Service',
target_phrase='DELL Customer Service')
print(search('NN', x, i))
y, j = smart_tokenizer(sentence="Good for everyday computing and web browsing.",
target_phrase="everyday computing")
print(search('NN', y, j))
编辑:我进行了一些更改,以解决使用任意长度目标短语的问题,如您在smart_tokenizer
函数中所见。关键是nltk.tokenize.MWETokenizer
类(有关更多信息,请参见:Python: Tokenizing with phrases)。希望这会有所帮助。顺便说一句,我会挑战spaCy
有必要更优雅的想法-在某些时候,有人必须编写代码才能完成工作。这将是spaCy
开发人员,或者您推出自己的解决方案。他们的API相当复杂,因此我将把练习留给您。