我决定熟悉一下箭头包。我想这会很好
想运行一些使用示例 (https://github.com/apache/arrow/tree/master/python/examples/minimal_build)。
docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu .
docker run --rm -t -i -v $PWD:/io arrow_ubuntu_minimal /io/build_venv.sh
不幸的是在运行后一个命令控制台后:
E ModuleNotFoundError: No module named 'pyarrow._dataset'
pyarrow/dataset.py:23: ModuleNotFoundError
====================================================================================== warnings summary ======================================================================================
pyarrow/tests/test_serialization.py:283
/root/arrow/python/pyarrow/tests/test_serialization.py:283: PytestDeprecationWarning: @pytest.yield_fixture is deprecated.
Use @pytest.fixture instead; they are the same.
@pytest.yield_fixture(scope='session')
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_infer_lists
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_to_list_of_structs_pandas
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
/root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if np.any(np.asarray(left_value != right_value)):
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
/root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
if np.any(np.asarray(left_value != right_value)):
-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================== short test summary info ===================================================================================
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_filesystem - ModuleNotFoundError: No module named 'pyarrow._dataset'
============================================================ 1 failed, 3168 passed, 689 skipped, 16 xfailed, 5 warnings in 48.01s ============================================================
marcin@marcin-G3-3579:
有没有人遇到过类似的问题或知道如何解决它?
我目前使用的是 ubuntu 20.04。也许这可能会导致问题,因为示例是在 ubuntu 18.04 上设置的,但我看不到检查它的方法。
答案 0 :(得分:2)
这在最小示例中看起来像是一个错误。随意file a JIRA。
Arrow C++ 包具有许多可以打开(以启用功能)或关闭(以加快构建时间并减少依赖性)的功能标志。依赖于某些功能的 python 测试应该检查该标志是否存在,如果不存在则跳过。这个测试不是这样做的。
与此同时,您可以忽略测试失败,将测试更改为跳过(我认为这是在测试名称上方添加 @pytest.mark.dataset
),或者将数据集添加到您的 C++ 构建(可能是我的首选选项)。
要将数据集添加到您的 C++ 构建中,您可以在 -DARROW_DATASET=ON
中添加 -DARROW_PARQUET=ON
(在 build_venv.sh
旁边)。