Sklearn内核密度数据类型

时间:2019-12-06 18:00:18

标签: python-3.x scikit-learn rapids

我需要在nvidia的Rapids Cudf库的定义块内为sklearn的内核密度函数指定dtype(数据类型)。在Python 3.7中,我能够找到类型信息,但是由于某些原因,它不被认为是带有nvidia的Rapids def块的可接受的数据类型。我将下面的代码和错误消息包括在内,以便任何人都可以重现该错误消息。

以下是内核密度函数的典型实现的代码:

from sklearn.neighbors import KernelDensity
import numpy as np

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
kde = KernelDensity(kernel='gaussian', bandwidth=0.2).fit(X)
kde.score_samples(X)
     array([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,
    -0.41076071])

type(kde)
     <class 'sklearn.neighbors.kde.KernelDensity'>

这是我与Sklearn的内核密度函数一起使用的NVIDIA Rapids Def模块:

import cudf, math
import numpy as np

df = cudf.DataFrame()
nelem = 10
df['in1'] = np.arange(nelem) * 1.5
df['in2'] = np.arange(nelem) * 1.45


#Define input columns for the kernel

in1 = df['in1']
in2 = df['in2']

def kernel(in1, in2, out1, out2, out3, out4, kwarg1, kwarg2):
    for i, (x, y) in enumerate(zip(in1, in2)):
        out1[i] = [math.tan(i) for i in x]
        out2[i] = np.array(out1[i].to_pandas())
        out3[i] = ((KernelDensity(kernel='gaussian', bandwidth=kwarg1).fit(out2[i])).score_samples(out2[i]))
        out4[i] = [i >= kwarg2 for i in out3[i]]

Results = cudf.DataFrame()
Results = df.apply_rows(kernel, incols=['in1','in2'], outcols=dict(out1='float', out2='float64', out3='float64', out4='float'), kwargs=dict(kwarg1=0.1, kwarg2=0.33))

这是错误消息(也许如果我得到了正确的dtype x和out3,这将解决所有错误):

 Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/dataframe/dataframe.py", line 2707, in apply_rows
self, func, incols, outcols, kwargs, cache_key=cache_key
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 64, in apply_rows return applyrows.run(df)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 128, in run self.launch_kernel(df, bound.args, **launch_params)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/cudf/utils/applyutils.py", line 152, in launch_kernel self.kernel[blkct, blksz](*args)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 806, in __call__ kernel = self.specialize(*args)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 817, in specialize kernel = self.compile(argtypes)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 833, in compile **self.targetoptions)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 62, in compile_kernel
cres = compile_cuda(pyfunc, types.void, args, debug=debug, inline=inline)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock, return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/cuda/compiler.py", line 51, in compile_cuda, locals={})
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 972, in compile_extra, return pipeline.compile_extra(func)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 390, in compile_extra, return self._compile_bytecode()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 903, in _compile_bytecode, return self._compile_core()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 890, in _compile_core, res = pm.run(self.status)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock, return func(*args, **kwargs)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 266, in run
raise patched_exception
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 257, in run
stage()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 515, in stage_nopython_frontend self.locals)
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/compiler.py", line 1124, in type_inference_stage, infer.propagate()
  File "/anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py", line 927, in propagate, raise errors[0]
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f2679e6f9e8>) with argument(s) of type(s): (array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), array(float64, 1d, A), float64, float64) * parameterized

In definition 0:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'x': cannot determine Numba type of <class 'numba.ir.UndefinedType'>

File "<stdin>", line 2:
<source missing, REPL/exec in use?>

raised from /anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py:1254

In definition 1:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'x': cannot determine Numba type of <class 'numba.ir.UndefinedType'>

File "<stdin>", line 2:
<source missing, REPL/exec in use?>

raised from /anaconda3/envs/rapidsAI/lib/python3.7/site-packages/numba/typeinfer.py:1254
This error is usually caused by passing an argument of a type that is unsupported by the named function.
[1] During: resolving callee type: Function(<numba.cuda.compiler.DeviceFunctionTemplate object at 0x7f2679e6f9e8>)
[2] During: typing of call at <string> (11)


 File "<string>", line 11:
 <source missing, REPL/exec in use?>

1 个答案:

答案 0 :(得分:1)

有效的代码如下。您的某些行与cudf不兼容:

  1. 单独使用i而不用于索引不起作用。始终为零。因此out1也是零
  2. sklearn中的类与numba nopython模式不兼容。对于numba不特别支持的任何库都适用。我不知道任何包含numba支持的内核密度估计的库。支持Numpy,但没有内核密度估计。
  3. df.apply_rows()不允许将一个函数应用于多个行,以便计算内核密度。您可能需要使用df.apply_chunks()。

要实现内核密度估计,您将需要:

  1. 使用df.apply_chunks()
  2. 创建一个自定义函数,该函数将计算内核密度。您可以使用此代码的一部分来创建函数:KernelDensity source code
  3. 自定义函数应该能够将内核应用于np.array以计算每个窗口的值
  4. apply_chunks()函数应设置为使小块滚动窗口

代码:

import cudf, math
import numpy as np

df = cudf.DataFrame()
nelem = 10
df['in1'] = np.arange(nelem) * 1.5
df['in2'] = np.arange(nelem) * 1.45


#Define input columns for the kernel

in1 = df['in1']
in2 = df['in2']

def kernel(in1, in2, out1, out2, out3, out4, kwarg1, kwarg2):
    for i, (x, y) in enumerate(zip(in1, in2)):
        out1[i] = math.tan(float(i)) 
        out2[i] = out1[i]
        out3[i] = 1 #((KernelDensity(kernel='gaussian', bandwidth=kwarg1).fit(out2[i])).score_samples(out2[i]))
        out4[i] = out3[i] >= kwarg2 

Results = cudf.DataFrame()
Results = df.apply_rows(kernel, incols=['in1','in2'], outcols=dict(out1=np.float64, out2=np.float64, out3=np.float64, out4=np.float64), kwargs=dict(kwarg1=0.1, kwarg2=0.33))