'搜索模式已用尽'在python3中处理WARC文件时发生

时间:2016-02-23 14:31:35

标签: python python-3.x warc

我尝试从WARC数据集(yahoo!webscope L2)中获取一些纯文本,并在python3模块ValueError: Search for pattern exhausted中使用load()函数时继续会面warcat 。尝试了一些随机的WARC示例文件,一切运行良好。

数据集确实要求提交进一步的许可(然后根据自述文件提供密码; WARC文件是否附带密码?)但是现在我没有能力发送传真

我还查看了warcat源代码,发现当file_obj.read(size)为False时会引发ValueError。这似乎对我没有意义,所以我在这里问......

代码:

>>> import warcat
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('ydata-embedded-metadata-v1_0.warc')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
    self.read_file_object(f)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
    record, has_more = self.read_record(file_object)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File "/usr/local/lib/python3.4/site-packages/warcat/model/record.py", line 59, in load
    inclusive=True)
  File "/usr/local/lib/python3.4/site-packages/warcat/util.py", line 66, in find_file_pattern
    raise ValueError('Search for pattern exhausted')
ValueError: Search for pattern exhausted

提前致谢。

0 个答案:

没有答案