Question

我最近在使用Python包装器构建Onionscan Scraper Onionscan之后一直在尝试tutorial。自编写本教程以来，Onionscan已经从存储JSON文件中的所有数据转移到将其中一些存储在使用Tiedot构建的数据库中。

我试图找到一种方法来获取其中一个没有扩展名的文件，简单地命名为protected override IEnumerable<ServiceReplicaListener> CreateServiceReplicaListeners() { return new[] { new ServiceReplicaListener(this.CreateServiceRemotingListener) }; }并使用Python解析它。

查看macOS Textedit中的dat_0我得到以下内容......

虽然Sublime Text将其显示为......

我一直在尝试使用Python解析如何解析此文件。我假设从Tiedot的文档和Textedit看到它使用JSON结构，但我没有太多运气。

dat_0

或

import json
f = open('crawls/dat_0','rb')
data = json.dumps(f.read())

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

有没有人...... a）知道这个文件是什么？ b）知道如何成功解析它以使数据在Python中可用吗？

Answer 1

对于将来发现这一点的人，我自己找到了解决方案。我使用Kaitai Struct为Tiedot文件结构创建二进制解析器。 Kaitai可以为多种语言创建解析器，因此它是一个非常有用的工具

我用来生成解析器的Kaitai结构是......

meta:
  id: parser
seq:
  - id: records
    type: record
    repeat: eos
types:
  record:
    seq:
    - id: validity
      type: s1
    - id: allocated
      type: s8le
    - id: document
      type: str
      encoding: utf-8
      terminator: 1
      eos-error: false

使用Python从Onionscan读取原始（二进制？）Tiedot（NOSQL / JSON）数据

1 个答案: