DASK:Typerrror:列分配不支持numpy.ndarray类型,而Pandas可以正常工作

时间:2019-10-06 04:33:58

标签: python pandas numpy dask

我正在使用Dask读取10m行csv +并执行一些计算。到目前为止,事实证明它比熊猫快10倍。

下面有一段代码,当与pandas一起使用时效果很好,但dask会引发类型错误。 我不确定如何克服打字错误。似乎在使用dask时,由select函数将数组移回数据框/列,但在使用pandas时,数组却没有?但是我不想把整个事情都换回熊猫,而失去10倍的性能优势。

这个答案是在Stack Overflow上获得其他一些帮助的结果,但是我认为这个问题与最初的问题相差甚远,这完全不同。下面的代码。

PANDAS:有效 排除AndHeathSolRadFact所需的时间:40秒

import pandas as pd
import numpy as np

from timeit import default_timer as timer
start = timer()
df = pd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
df['DateTime'] = pd.to_datetime(df['Date'], format='%Y-%d-%m %H:%M')
df['Month'] = df['DateTime'].dt.month
df['Grass_FMC'] = (97.7+4.06*df['RH'])/(df['Temperature']+6)-0.00854*df['RH']+3000/df['Curing']-30


df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match



print(df.head())
#print(ddf.tail())
end = timer()
print(end - start)

任务:破损 排除AndHeathSolRadFact所需的时间:4秒

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np

# Dataframes implement the Pandas API
import dask.dataframe as dd



from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M')
ddf['Month'] = ddf['DateTime'].dt.month
ddf['Grass_FMC'] = (97.7+4.06*ddf['RH'])/(ddf['Temperature']+6)-0.00854*ddf['RH']+3000/ddf['Curing']-30



ddf["AndHeathSolRadFact"] = np.select(
    [
    (ddf['Month'].between(8,12)),
    (ddf['Month'].between(1,2) & ddf['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match



print(ddf.head())
#print(ddf.tail())
end = timer()
print(end - start)


错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-50-86c08f38bce6> in <module>
     29     ],  #list of conditions
     30     [1, 1],     #list of results
---> 31     default=0)    #default if no match
     32 
     33 

~\Anaconda3\lib\site-packages\dask\dataframe\core.py in __setitem__(self, key, value)
   3276             df = self.assign(**{k: value for k in key})
   3277         else:
-> 3278             df = self.assign(**{key: value})
   3279 
   3280         self.dask = df.dask

~\Anaconda3\lib\site-packages\dask\dataframe\core.py in assign(self, **kwargs)
   3510                 raise TypeError(
   3511                     "Column assignment doesn't support type "
-> 3512                     "{0}".format(typename(type(v)))
   3513                 )
   3514             if callable(v):

TypeError: Column assignment doesn't support type numpy.ndarray

Weathegrids CSV样本

Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0 
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0

6 个答案:

答案 0 :(得分:2)

我也遇到了类似的问题,我可以通过将ndarray转换为Dask数组来使其工作。我还必须确保ndarray和Dask DataFrame之间匹配的分区数。

答案 1 :(得分:1)

将系列分配给 Dask 专栏作品。

dask_df['col'] = pd.Series(list or array)

答案 2 :(得分:1)

由于某种原因,我还不完全清楚,上述解决方案对我不起作用。

我最终定义了一个函数,该函数对每个 Pandas 数据帧进行列分配,然后将该函数映射到我的所有 dask 分区。

def map_randoms(df):
    df['col_rand'] = np.random.randint(0,2, size=len(df))
    return df

ddf = ddf.map_partitions(map_randoms)
ddf.persist()

答案 3 :(得分:0)

这个答案不是很好,但是很实用。

我发现在熊猫中一个11m行的数据集上,选择功能要快20秒左右。我还发现,即使我在dask中执行了相同的功能,结果也将返回一个numpy(pandas)数组。 Dask本质上不能接受这一点,但是有可能在Dask和Pandas之间传输数据帧。

因此,我获得了在dask中进行加载和日期转换的好处(4秒,而在熊猫中则为40秒),选择使用pandas的好处(在40%,而在dask中则为60秒),并且只需要接受我将使用更多的内存。

通过在数据帧之间进行转换,时间损失很小。

最后,我必须确保我清理了数据帧,因为python在测试运行之间没有清理内存,而只是不断积累。

import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np

# Dataframes implement the Pandas API
import dask.dataframe as dd

from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#print(ddf.describe(include='all'))

#Wrangle the dates so we can interrogate them
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M')
ddf['Month'] = ddf['DateTime'].dt.month

#Grass Fuel Moisture Content
ddf['Grass_FMC'] = (97.7+4.06*ddf['RH'])/(ddf['Temperature']+6)-0.00854*ddf['RH']+3000/ddf['Curing']-30

#Convert to a Pandas DataFrame because dask was being slow with the select logic below
df = ddf.compute() 
del [ddf]

#ddf["AndHeathSolRadFact"] = np.select(
#Solar Radiation Factor - this seems to take 32 seconds. Why?
df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0)    #default if no match

#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]

print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)

#Clean up remaining dataframes
del [[ddf2]]

答案 4 :(得分:0)

您可以尝试在np.select()语句的末尾添加.any()或.all()吗?

df["AndHeathSolRadFact"] = np.select(
    [
    (df['Month'].between(8,12)),
    (df['Month'].between(1,2) & df['CloudCover']>30)
    ],  #list of conditions
    [1, 1],     #list of results
    default=0).all()    #default if no match

答案 5 :(得分:0)

编辑: 对于您的问题,我确实有一个很好的解决方案:-

from dask.array import from_array as fa
df.compute()['Name of you column'] = fa(the_list_you_want_to_assign_as_column)