我有一个Dask数据帧,其中一列包含一个numpy浮点数组:
df.to_parquet('somefile')
....
Error converting column "vec" to bytes using encoding UTF8. Original error: bad argument type for built-in operation
如果我尝试将其写成镶木地板,我会收到错误:
--Test rig
create table #atable
(
name varchar(10),
versionnum varchar(10),
privilege varchar(10)
)
insert into #atable (name, versionnum, privilege)
values
('Tim', 'A', 'Level 1'),
('Tim', 'B', 'Level 2'),
('Charles', 'A', 'Level 1'),
('Alex', 'C', 'Level 2')
--End Test Rig
CREATE TABLE #temptable
(
name varchar(10),
acount int
)
insert into #temptable (name, acount)
select distinct name, 0 from #atable
update #temptable
set acount = acount+1
where #temptable.name in (select name from #atable where #atable.versionnum = 'A')
update #temptable
set acount = acount+1
where #temptable.name in (select name from #atable where #atable.privilege = 'Level 2')
select * from #atable
select * from #temptable
drop table #temptable
drop table #atable
我认为这是因为' vec' column具有类型' object',因此镶木地板序列化器会尝试将其写为字符串。有没有办法告诉Dask DataFrame或序列化程序列是浮点数组?
答案 0 :(得分:4)
我发现如果使用pyarrow引擎代替默认的fastparquet是可能的:
pip/conda install pyarrow
然后:
df.to_parquet('somefile', engine='pyarrow')
https://github.com/dask/fastparquet/的fastparquet文档说“只支持简单的数据类型和普通编码”,所以我猜这意味着没有数组。