Question

我有一个numpy数组的句子（字符串）

arr = np.array(['It's the most wonderful time of the year.',
               'With the kids jingle belling.',
               'And everyone telling you be of good cheer.',
               'It's the hap-happiest season of all.'])

（我从csv文件中读取）。我需要用这些句子中的所有独特单词制作一个numpy数组。

所以我需要的是

array(["It's", "the", "most", "wonderful", "time", "of" "year", "With", "the", "kids", "jingle", "belling" "and", "everyone", "telling", "you", "be", "good", "cheer", "It's", "hap-happiest", "season", "all"])

我可以这样做

o = []
for x in arr:
    o += x.split()
words = np.array(o)
unique_words = np.array(list(set(words.tolist())))

但是因为这涉及首先制作列表然后将那个转换为numpy数组，对于大数据来说，它显然会慢而且效率低。

我也尝试了nltk，如

words = np.array([])
for x in arr:
    words = np.append(words, nltk.word_tokenize(x))

但由于在每次迭代时都会创建一个新数组而不是正在修改的旧数组，因此这似乎效率低下。

我想有一些优雅的方法可以用更多的numpy实现我想要的东西。

你能指出我正确的方向吗？

Answer 1

我认为你可以尝试这样的事情：

vocab = set()
for x in arr:
    vocab.update(nltk.word_tokenize(x))

set.update()使用iterable将元素添加到现有集合中。

<强>更新：

另外，您可以查看CountVectorizer in scikit-learn的工作原理：

将文本文档集合转换为令牌计数矩阵。

它使用字典来跟踪独特的单词：

    # raw_documents is an iterable of sentences.
    for doc in raw_documents:
        feature_counter = {}

        # analyze will split the sentences into tokens 
        # and apply some preprocessing on them (like stemming, lemma etc)
        for feature in analyze(doc):
            try:
                # vocabulary is a dictionary containing the words and their counts
                feature_idx = vocabulary[feature]
                ...
                ...

我认为它非常有效。所以我认为你也可以使用dict()代替set。我不熟悉NLTK的工作，但我认为它还必须包含与CountVectorizer等效的东西。

Answer 2

我不确定numpy是去这里的最佳方式。您可以使用嵌套列表和集或词典来实现您想要的功能。

要知道的一件有用的事情是来自nltk的tokenizer方法可以处理一个句子列表，并返回一个标记化句子列表。例如：

from nltk.tokenize import WordPunktTokenizer
wpt = WordPunktTokenizer()

tokenized = wpt.tokenize_sents(arr)

这将返回arr中的标记化句子列表，即：

[['It', "'", 's', 'the', 'most', 'wonderful', 'time', 'of', 'the', 'year', '.'],
 ['With', 'the', 'kids', 'jingle', 'belling', '.'],
 ['And', 'everyone', 'telling', 'you', 'be', 'of', 'good', 'cheer', '.'],
 ['It', "'", 's', 'the', 'hap', '-', 'happiest', 'season', 'of', 'all', '.']]

nltk附带了许多不同的标记器，因此可以为您提供最佳将句子拆分为单词标记的选项。然后，您可以使用以下内容来获取唯一的单词/标记集：

unique_words = set()
for toks in tokenized:
    unique_words.update(toks)

从一大堆句子字符串中获取单词

2 个答案: