我目前在word2vec模型中使用uni-gram如下。
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list of words
#
#NLTK tokenizer to split the paragraph into sentences
raw_sentences = tokenizer.tokenize(review.strip())
sentences = []
for raw_sentence in raw_sentences:
# If a sentence is empty, skip it
if len(raw_sentence) > 0:
# Otherwise, call review_to_wordlist to get a list of words
sentences.append( review_to_wordlist( raw_sentence, \
remove_stopwords ))
#
# Return the list of sentences (each sentence is a list of words,
# so this returns a list of lists
return sentences
但是,我会错过我的数据集中重要的双字母组和三元组。
E.g.,
"team work" -> I am currently getting it as "team", "work"
"New York" -> I am currently getting it as "New", "York"
因此,我想在我的数据集中捕获重要的双字母组,三元组等,并输入到我的word2vec模型中。
我是wordvec的新手并且在努力学习如何去做。请帮帮我。
答案 0 :(得分:13)
首先,你应该使用gensim的类Phrases来获取bigrams,这在文档中指出了
>>> bigram = Phraser(phrases)
>>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there']
>>> print(bigram[sent])
[u'the', u'mayor', u'of', u'new_york', u'was', u'there']
要获得三元组等,你应该使用你已经拥有的二元模型并再次应用短语,依此类推。 例如:
trigram_model = Phrases(bigram_sentences)
还有一个很好的笔记本和视频,解释了如何使用它.... the notebook,the video
最重要的部分是如何在现实句中使用它,如下所示:
// to create the bigrams
bigram_model = Phrases(unigram_sentences)
// apply the trained model to a sentence
for unigram_sentence in unigram_sentences:
bigram_sentence = u' '.join(bigram_model[unigram_sentence])
// get a trigram model out of the bigram
trigram_model = Phrases(bigram_sentences)
希望这会对您有所帮助,但下次会向我们提供有关您正在使用的内容的更多信息等。
P.S:现在你编辑了它,你没有做任何事情才能让bigrams分裂它,你必须使用短语来获得像纽约这样的文字作为bigrams。答案 1 :(得分:6)
from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents =
["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"]
sentence_stream = [doc.split(" ") for doc in documents]
print(sentence_stream)
bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
print(bigram_phraser)
for sent in sentence_stream:
tokens_ = bigram_phraser[sent]
print(tokens_)
答案 2 :(得分:0)
您应该寻找的短语和短语
bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
添加完唱子后,就可以使用Phraser加快访问速度并提高内存使用效率。不是强制性的,但很有用。
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
谢谢