Question

这是我要回答的question。

假设我有一个Pandas df
      col_name
1    [16, 4, 30]   
2    [5, 1, 2]   
3    [4, 5, 52, 888]
4    [1, 2, 4]
5    [5, 99, 4, 75, 1, 2]
我想删除整列中的所有元素   出现少于x次，例如让我们取x = 3

这意味着我希望结果如下：
      col_name
1    [4]   
2    [5, 1, 2]   
3    [4, 5]
4    [1, 2, 4]
5    [5, 4, 1, 2]

为方便起见，这是数据。

d = {'col_name': {1: [16, 4, 30],
      2: [5, 1, 2],
      3: [4, 5, 52, 888],
      4: [1, 2, 4],
      5: [5, 99, 4, 75, 1, 2]}}

df = pd.DataFrame(d)

目前的做法：

from collections import Counter
c = Counter(pd.Series(np.concatenate(df.col_name.tolist())))

def foo(array):
    return [x  for x in array if c[x] >= 3]

df.col_name = df.col_name.apply(foo)
df

       col_name
1           [4]
2     [5, 1, 2]
3        [4, 5]
4     [1, 2, 4]
5  [5, 4, 1, 2]

哪个有效，但速度很慢。所以，我想使用np.vectorize并加快速度：

v  = np.vectorize(foo)
df.col_name = v(df.col_name)   # <---- error thrown here

并收到此错误：

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/numpy/lib/function_base.py in _vectorize_call(self, func, args)
   2811 
   2812             if ufunc.nout == 1:
-> 2813                 res = array(outputs, copy=False, subok=True, dtype=otypes[0])
   2814             else:
   2815                 res = tuple([array(x, copy=False, subok=True, dtype=t)

ValueError: setting an array element with a sequence.

我似乎对np.vectorize的工作方式有误解。我做错了什么，如果可以的话，如何让这个解决方案与np.vectorize一起使用？

为了澄清，我不是在寻找一种解决方法，只是帮助理解为什么我会收到此错误。

Answer 1

使用您的数据框和功能：

In [70]: df
Out[70]: 
               col_name
1           [16, 4, 30]
2             [5, 1, 2]
3       [4, 5, 52, 888]
4             [1, 2, 4]
5  [5, 99, 4, 75, 1, 2]

In [71]: df.values     # values is an object array
Out[71]: 
array([[list([16, 4, 30])],
       [list([5, 1, 2])],
       [list([4, 5, 52, 888])],
       [list([1, 2, 4])],
       [list([5, 99, 4, 75, 1, 2])]], dtype=object)

使用apply，但返回系列，而不是修改df：

In [73]: df.col_name.apply(foo)
Out[73]: 
1             [4]
2       [5, 1, 2]
3          [4, 5]
4       [1, 2, 4]
5    [5, 4, 1, 2]
Name: col_name, dtype: object
In [74]: timeit df.col_name.apply(foo)
214 µs ± 912 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

为了进行比较，将foo应用于原始词典d：

In [76]: {i:foo(d['col_name'][i]) for i in range(1,6)}
Out[76]: {1: [4], 2: [5, 1, 2], 3: [4, 5], 4: [1, 2, 4], 5: [5, 4, 1, 2]}
In [77]: timeit {i:foo(d['col_name'][i]) for i in range(1,6)}
18.3 µs ± 39.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

请注意，这比仅从数据框中提取列表更快。

In [84]: timeit df.col_name.tolist()
25.3 µs ± 92 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

foo应用于列表而不是字典大致相同：

In [85]: dlist=df.col_name.tolist()
In [86]: timeit [foo(x) for x in dlist]
16.6 µs ± 27.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

定义object矢量化函数：

In [87]: f = np.vectorize(foo, otypes=[object])
In [88]: f(dlist)
Out[88]: 
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
       list([5, 4, 1, 2])], dtype=object)
In [89]: timeit f(dlist)
36.7 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

这比直接迭代慢。将列表预转换为对象数组（darr=np.array(dlist)）只需保存μs或两个。

由于我们正在返回一个对象数组，我们不妨使用frompyfunc（vectorize使用）：

In [94]: ff = np.frompyfunc(foo, 1,1)
In [95]: ff(darr)
Out[95]: 
array([list([4]), list([5, 1, 2]), list([4, 5]), list([1, 2, 4]),
       list([5, 4, 1, 2])], dtype=object)
In [96]: timeit ff(darr)
18 µs ± 6.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我测试了frompyfunc比直接迭代快2倍的情况。这可能是一个更大的测试阵列的情况。

在numpy个用户中，np.vectorize以缓慢而闻名，而且通常很难使用（特别是如果省略otypes）。它的表观速度相对于pandas apply，与数组应用程序相比，它似乎有很多开销。

鉴于pandas使用对象dtype数组的倾向，frompyfunc可能是比np.vectorize更好的工具。

至于普通vectorize提出错误的原因，我怀疑它与选择隐含otypes的方式有关。

In [106]: f1 = np.vectorize(foo)
In [107]: f(darr[[0,0,0]])
Out[107]: array([list([4]), list([4]), list([4])], dtype=object)
In [108]: f1(darr[[0,0,0]])
...
ValueError: setting an array element with a sequence.

我们必须深入研究vectorize代码，但我怀疑它从第一个[4]结果推断出返回类型应该是一个整数。但实际的调用会返回一个列表。即使是1个元素列表也不适合整数槽。

测试用于确定vectorize的{{1}}方法：

otypes

In [126]: f1._get_ufunc_and_otypes(foo,[darr]) Out[126]: (<ufunc '? (vectorized)'>, 'l')从输入数组的第一个元素计算_get_ufunc_and_otypes，然后

outputs

在您的情况下if isinstance(outputs, tuple): nout = len(outputs) else: nout = 1 outputs = (outputs,) otypes = ''.join([asarray(outputs[_k]).dtype.char for _k in range(nout)])是outputs列表，因此它会将[4]设置为1，并从第一个结果中推断出nout。如果otypes是第一个，也会发生同样的事情。

这种自动[5,1,2]通常会在用户想要浮点结果时咬人，但第一个值会返回一个整数，例如0.然后他们会意外截断。

该方法对otypes类型进行了测试。让我们测试一下：

outputs的第一个版本，它返回一个元组而不是列表：

foo

应用于整个In [162]: foot = lambda x: tuple(foo(x)) In [163]: [foot(x) for x in darr] Out[163]: [(4,), (5, 1, 2), (4, 5), (1, 2, 4), (5, 4, 1, 2)] In [164]: ft = np.vectorize(foot)时出现相同的错误：

darr

但是当应用In [165]: ft(darr) ... ValueError: setting an array element with a sequence.的一个子集都返回3个元素时，我得到一个数组元组：

darr

这对原始问题没有帮助，但确实说明了使用In [167]: ft(darr[[1,3,1,3]]) Out[167]: (array([5, 1, 5, 1]), array([1, 2, 1, 2]), array([2, 4, 2, 4]))的力量或复杂性。

Answer 2

您需要在np.vectorize

中指定输出数据类型otypes=[list/object/np.ndarray/etc]

In [2767]: def foo(array):
      ...:     return [x  for x in array if c[x] >= 3]

In [2768]: v = np.vectorize(foo, otypes=[list])

In [2769]: v(df.col_name)
Out[2769]: array([[4], [5, 1, 2], [4, 5], [1, 2, 4], [5, 4, 1, 2]], dtype=object)

In [2770]: df.assign(new_wack=v(df.col_name))
Out[2770]:
               col_name      new_wack
1           [16, 4, 30]           [4]
2             [5, 1, 2]     [5, 1, 2]
3       [4, 5, 52, 888]        [4, 5]
4             [1, 2, 4]     [1, 2, 4]
5  [5, 99, 4, 75, 1, 2]  [5, 4, 1, 2]

来自文档，

如果未指定otypes，则将使用带有第一个参数的函数调用来确定输出的数量。如果缓存为True，则将缓存此调用的结果，以防止调用该函数两次。

使用np.vectorize时的ValueError - 我哪里出错？

2 个答案: