Question

我有一个用于清除设置停用词文本的功能：

def clean_text(raw_text, stopwords_set):
    # removing everything which is not a letter
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
    # lower case + split --> list of words
    words = letters_only.lower().split()             
    # now remove the stop words
    meaningful_words = [w for w in words if not w in stopwords_set]
    # join the remaining words together to get the cleaned tweet
    return " ".join(meaningful_words)

在pandas数据框中有160万条Twitter推文的数据集。如果我只是apply这个函数对数据帧如下：

dataframe['clean_text'] = dataframe.apply(
    lambda text: clean_text(text, set(stopwords.words('english'))),
    axis = 1)

计算需要2分钟才能完成（大约）。但是，当我像这样使用np.vectorize时：

dataframe['clean_text'] = np.vectorize(clean_text)(
    dataframe['text'], set(stopwords.words('english')))

计算在10秒（大约）之后结束。

如果不是两种方法只在我的机器上使用一个核心，那本身就不会令人惊讶。我假设，使用vectorize，它会自动使用多个内核来更快地完成，这样可以获得更快的速度，但它似乎做了不同的事情。

numpy''vectorize`做什么样的“魔术”？

Answer 1

我想知道vectorize如何处理这些输入。它旨在获取数组输入，相互广播它们，并将所有元素（作为标量）提供给您的函数。特别是我想知道它是如何处理set的。

使用您的功能和print(stop_words)添加，我得到了

In [98]: words = set('one two three four five'.split())
In [99]: f=np.vectorize(clean_text)
In [100]: f(['this is one line with two words'],words)
{'five', 'four', 'three', 'one', 'two'}
{'five', 'four', 'three', 'one', 'two'}
Out[100]: 
array(['this is line with words'], 
      dtype='<U23')

该集显示两次，因为vectorize运行测试用例以确定返回数组的dtype。但与我担心的是将整个集合传递给函数相反。这是因为在数组中包装set只会创建0d对象数组：

In [101]: np.array(words)
Out[101]: array({'five', 'four', 'three', 'one', 'two'}, dtype=object)

由于我们不希望向量化函数迭代第二个参数，因此我们应该使用excluded参数。速度差异可能微不足道。

In [104]: f=np.vectorize(clean_text, excluded=[1])
In [105]: f(['this is one line with two words'],words)

但只有一个数组或数据集要迭代，vectorize只不过是1d迭代或列表理解：

In [111]: text = ['this is one line with two words']
In [112]: [clean_text(t, words) for t in text]
Out[112]: ['this is line with words']

如果我将文本列表设置得更长（10000）：

In [121]: timeit [clean_text(t, words) for t in text]
10 loops, best of 3: 98.2 ms per loop
In [122]: f=np.vectorize(clean_text, excluded=[1])
In [123]: timeit f(text,words)
10 loops, best of 3: 158 ms per loop
In [124]: f=np.vectorize(clean_text)
In [125]: timeit f(text,words)
10 loops, best of 3: 108 ms per loop

excluded实际上减慢了vectorize的速度;没有它，列表理解和向量化也会表现相同。

因此，如果pandas apply慢得多，那就不是因为vectorize是神奇的。这是因为apply很慢。

numpy的vectorize做了什么？

1 个答案: