Question

我有一个接受列表（字符串）的函数。它对该列表进行一些处理并返回另一个字符串列表，可能是较短的长度。

现在，我有一个numpy字符串输入列表数组。我想将这个转换函数应用于我的数组中的每个列表。

从我到目前为止所做的搜索来看，似乎vectorize或apply_along_axis可能是不错的候选人，但两者都没有按预期工作。

我想尽可能高效地完成这项工作。最终输入数组将包含大约100K列表。

我想我可以在for循环中迭代numpy数组，然后append每个输出列表一次一个地输出一个新的输出数组，但这看起来非常低效。

这是我尝试过的。出于测试目的，我做了一个愚蠢的向下转换函数，输入数组只包含3个列表。

def my_func(l):
    # accepts list, returns another list
    # dumbed down list transformation function
    # for testing, just return the first 2 elems of original list
    return l[0:2]

test_arr = np.array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum'], ['this', 'is', 'a', 'test']])

np.apply_along_axis(my_func, 0, test_arr)
Out[51]: array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum']], dtype=object)

# Rather than applying item by item, this returns the first 2 elements of the entire outer array!!

# Expected:
# array([['the', 'quick'], ['lorem', 'ipsum'], ['this', 'is']])

# Attempt 2...

my_func_vec = np.vectorize(my_func)
my_func_vec(test_arr)

结果：

Traceback (most recent call last):

  File "<ipython-input-56-f9bbacee645c>", line 1, in <module>
    my_func_vec(test_arr)

  File "C:\Users\Tony\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2218, in __call__
    return self._vectorize_call(func=func, args=vargs)

  File "C:\Users\Tony\Anaconda2\lib\site-packages\numpy\lib\function_base.py", line 2291, in _vectorize_call
    copy=False, subok=True, dtype=otypes[0])

ValueError: cannot set an array element with a sequence

Answer 1

从otypes的文档字符串中读取有关可选参数otypes : str or list of dtypes, optional The output data type. It must be specified as either a string of typecode characters or a list of data type specifiers. There should be one data type specifier for each output.

的内容

my_func_vec = np.vectorize(my_func, otypes=[list])

它允许您创建具有复杂输出的结构化数组，但也解决了将列表作为数组元素的问题。

i=0

Answer 2

一些比较和时间测试;但请记住，这是一个小例子。

In [106]: test_arr = np.array([['the', 'quick', 'brown', 'fox'], ['lorem', 'ipsum'], ['this', 'is', 'a', 'test']])
     ...: 
In [107]: def my_func(l):
     ...:     # accepts list, returns another list
     ...:     # dumbed down list transformation function
     ...:     # for testing, just return the first 2 elems of original list
     ...:     return l[0:2]
     ...:

list comprehension方法返回一个2d字符串数组 - 因为该函数每次返回2个元素列表。

In [108]: np.array([my_func(x) for x in test_arr])
Out[108]: 
array([['the', 'quick'],
       ['lorem', 'ipsum'],
       ['this', 'is']],
      dtype='<U5')

输入数组是对象dtype，因为子列表的长度不同：

In [109]: test_arr
Out[109]: 
array([list(['the', 'quick', 'brown', 'fox']), list(['lorem', 'ipsum']),
       list(['this', 'is', 'a', 'test'])], dtype=object)

frompyfunc返回一个对象dtype数组;与我过去的测试一致，它的速度要慢一些（2倍但从不是一个数量级）

In [110]: np.frompyfunc(my_func,1,1)(test_arr)
Out[110]: 
array([list(['the', 'quick']), list(['lorem', 'ipsum']),
       list(['this', 'is'])], dtype=object)

In [111]: timeit np.frompyfunc(my_func,1,1)(test_arr)
5.68 µs ± 230 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [112]: timeit np.array([my_func(x) for x in test_arr])
8.96 µs ± 25.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

vectorize使用frompyfunc但有更多开销。 otypes需要避免sequence错误（否则它会尝试从试算中推断出返回类型）：

In [113]: np.vectorize(my_func,otypes=[object])(test_arr)
Out[113]: 
array([list(['the', 'quick']), list(['lorem', 'ipsum']),
       list(['this', 'is'])], dtype=object)
In [114]: timeit np.vectorize(my_func,otypes=[object])(test_arr)
30.4 µs ± 132 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Answer 3

[my_func(x) for x in test_arr]

你需要降低一级，你的解决方案只输出数组的2个第一项而不是数组中每个项的2个第一项。

将函数应用于numpy列表中的每个列表

3 个答案: