如何在pyarrow中实现Dremel纸示例?

时间:2018-10-15 06:49:44

标签: python parquet pyarrow apache-arrow

我正在尝试使用Dremel论文Dremel: Interactive Analysis of Web-Scale Datasets中的以下示例文档架构:

message Document {
  required int64 DocId;
  optional group Links {
    repeated int64 Backward;
    repeated int64 Forward; 
  }
  repeated group Name {
    repeated group Language {
      required string Code;
      optional string Country; 
    }
    optional string Url; 
  }
}

使用pyarrow以Apache Parquet文件格式保存文档。我正在尝试的实现如下:

import pyarrow as pa
import pyarrow.parquet as pq

links_type = pa.struct([
    pa.field("backward", pa.list_(pa.int64())),
    pa.field("forward", pa.list_(pa.int64())),
])

language_type = pa.struct([
    pa.field("code", pa.string(), nullable=False),
    pa.field("country", pa.string())
])

names_type = pa.struct([
    pa.field("language", pa.list_(language_type)),
    pa.field("url", pa.string()),
])

document_type = pa.struct([
    pa.field("doc_id", pa.int64(), nullable=False),
    pa.field("links", links_type, nullable=True),
    pa.field("name", pa.list_(names_type))
])

r1 = {
    "doc_id": 10,
    "links": {
        "forward": [20, 40, 60]
    },
    "name": [
        {
            "language": [
                {
                    "code": "en_us",
                    "country": "us"
                },
                {
                    "code": "en"
                }
            ],
            "url": "http://A"
        },
        {
            "url": "http://B"
        },
        {
            "language": [
                {
                    "code": "en-gb",
                    "country": "gb"
                }
            ]
        }
    ]
}

r2 = {
    "doc_id": 20,
    "links": {
        "forward": [80],
        "backward": [10, 30],
    },
    "name": [
        {
            "url": "http://C"
        }
    ]
}

records = pa.array([r1,r2], document_type)
batch = pa.RecordBatch.from_arrays([records], names=["documents"])
table = pa.Table.from_batches([batch])

pq.write_table(table, "dremel_pyarrow.parquet")

但是此程序在pq.write_table(table, "dremel_pyarrow.parquet")中以异常结尾并出现分段错误:

    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 934, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children
Segmentation fault (core dumped)

所以我很好奇能否在pyarrow中复制Dremel纸示例吗?

0 个答案:

没有答案