我使用.arrow
将一个示例数据帧转换为pyarrow
文件
import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({"a": [10, 2, 3]})
df['a'] = pd.to_numeric(df['a'],errors='coerce')
table = pa.Table.from_pandas(df)
writer = pa.RecordBatchFileWriter('test.arrow', table.schema)
writer.write_table(table)
writer.close()
这将创建文件test.arrow
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 1 columns):
a 3 non-null int64
dtypes: int64(1)
memory usage: 104.0 bytes
然后在NodeJS中,我用arrowJS加载文件。 https://arrow.apache.org/docs/js/
const fs = require('fs');
const arrow = require('apache-arrow');
const data = fs.readFileSync('test.arrow');
const table = arrow.Table.from(data);
console.log(table.schema.fields.map(f => f.name));
console.log(table.count());
console.log(table.get(0));
打印效果类似
[ 'a' ]
0
null
我希望该表的长度为3,而table.get(0)
给出第一行而不是null
。
这是表的缩影,看起来像console.log(table._schema)
[ Int_ [Int] { isSigned: true, bitWidth: 16 } ]
Schema {
fields:
[ Field { name: 'a', type: [Int_], nullable: true, metadata: Map {} } ],
metadata:
Map {
'pandas' => '{"index_columns": [{"kind": "range", "name": null, "start": 0, "stop": 5, "step": 1}], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "field_name": "a", "pandas_type": "int16", "numpy_type": "int16", "metadata": null}], "creator": {"library": "pyarrow", "version": "0.15.0"}, "pandas_version": "0.22.0"}' },
dictionaries: Map {} }
有人知道为什么它没有获得预期的数据吗?
答案 0 :(得分:2)
这是由于箭头0.15中的格式更改,如Apache JIRA上的mentioned by Wes所致。这意味着,在将IPC文件发送到较旧版本的Arrow时, all Arrow库(不仅是PyArrow)都会出现此问题。解决方法是将ArrowJS升级到0.15.0,以便您可以在其他Arrow库和JS库之间往返。如果由于某些原因而无法更新,则可以使用以下解决方法之一:
将use_legacy_format=True
传递给RecordBatchFileWriter
:
with pa.RecordBatchFileWriter('file.arrow', table.schema, use_legacy_format=True) as writer:
writer.write_table(table)
将环境变量ARROW_PRE_0_15_IPC_FORMAT
设置为1:
$ export ARROW_PRE_0_15_IPC_FORMAT = 1
$ python
>>> import pyarrow as pa
>>> table = pa.Table.from_pydict( {"a": [1, 2, 3], "b": [4, 5, 6]} )
>>> with pa.RecordBatchFileWriter('file.arrow', table.schema) as writer:
... writer.write_table(table)
...
或将PyArrow降级为0.14.x
:
$ conda install -c conda-forge pyarrow=0.14.1