如何将包含一列数组的Dask数据帧写入镶木地板文件

时间:2018-02-14 19:05:58

标签: python dask fastparquet

我有一个Dask数据帧,其中一列包含一个numpy浮点数组:

df.to_parquet('somefile')
....
Error converting column "vec" to bytes using encoding UTF8. Original error: bad argument type for built-in operation

如果我尝试将其写成镶木地板,我会收到错误:

--Test rig
create table #atable
(
name varchar(10),
versionnum varchar(10),
privilege varchar(10)
)

insert into #atable (name, versionnum, privilege)
values
('Tim', 'A', 'Level 1'),
('Tim', 'B', 'Level 2'),
('Charles', 'A', 'Level 1'),
('Alex', 'C', 'Level 2')
--End Test Rig

CREATE TABLE #temptable
(
  name varchar(10),
  acount int
)

insert into #temptable (name, acount)
select distinct name, 0 from #atable

update #temptable
set acount = acount+1
where #temptable.name in (select name from #atable where #atable.versionnum = 'A')

update #temptable
set acount = acount+1
where #temptable.name in (select name from #atable where #atable.privilege = 'Level 2')

select * from #atable
select * from #temptable

drop table #temptable
drop table #atable

我认为这是因为' vec' column具有类型' object',因此镶木地板序列化器会尝试将其写为字符串。有没有办法告诉Dask DataFrame或序列化程序列是浮点数组?

1 个答案:

答案 0 :(得分:4)

我发现如果使用pyarrow引擎代替默认的fastparquet是可能的:

pip/conda install pyarrow

然后:

df.to_parquet('somefile', engine='pyarrow')

https://github.com/dask/fastparquet/的fastparquet文档说“只支持简单的数据类型和普通编码”,所以我猜这意味着没有数组。