Question

我试图在句子列表中找到特定字符串列表的单词位置。我使用numpy，sklean和nltk来实现这一目标。在我的实际代码中，我有10000个句子和单词列表同样长，所以我试图定期循环和列表/集合，因为它们不够快。

到目前为止，我已经编写了以下代码

from nltk.tokenize import TweetTokenizer
import nltk
import numpy as np
from sklearn import feature_extraction

sentences = ["Great place and so amazing", "I like doughnuts", "Mary had a little lamb"]

posWords = ["great","like","amazing","little lamb"]

# Here we see which words from the wordlist appear in the sentences.
cv = feature_extraction.text.CountVectorizer(vocabulary=posWords)
taggedSentences = cv.fit_transform(sentences).toarray() # This vector is of size (noOfSentences x noOfWordsInPoswords)

taggedSentencesCutDown = taggedSentences > 0
taggedSentencesCutDown = np.column_stack(np.where(taggedSentencesCutDown)) # This is a list of tuples (sentence, wordIndex)


sentencesIdentified = np.unique(taggedSentencesCutDown[:,0])


for sentenceIdx in sentencesIdentified:

    tokenisedSent = np.array(tknzr.tokenize(sentences[sentenceIdx]))
    wordsFoundSent = np.where(taggedSentencesCutDown[:,0] == sentenceIdx)
    wordsFoundSent = taggedSentencesCutDown[wordsFoundSent]

    matches = np.where(posWords[wordsFoundSent[:,1]] in tokenisedSent)
    sent = tokenisedSent[matches]

理想情况下我想要的是以下数组

[[0, 0, 0], [0, 2, 4], [1, 1, 1], [2, 3, 3]]

# where each triplet represents [sentenceNumber, wordNo, Position in sentence]

我在这里需要两件事：

使用NLTK标记来标记taggedSentencesCutDown数组中的所有句子（最好不使用for循环，正如我现在所做的那样，因为我的真实数组有10,000个句子和单词）
CountVectorizer可以像＆＃34;小羊羔一样处理字符串＆＃34;？目前这还没有被抓住。有没有办法像countvectorizer一样有效和优雅地做到这一点？

由于

如何在numpy字符串数组中标记所有字符串

0 个答案: