Question

我一直在搜索网站，但无法真正找到我需要的东西。我有web.warc.gz文件，其中包含数据，我需要提取WARC标头。我已经安装了Tomcat和Wayback（1.6）尝试使用Wayback提供的./warc-header脚本来推导它，但我一直收到我正在使用的格式的错误消息：

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
      USAGE: tgtWarc fieldsSrc id
        tgtWarc is the path to the target WARC.gz
          fieldsSrc is the path to the text of the record
    make sure each line is terminated by \r\n
    and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
    of the header record... header...

或其他类型的错误：

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

我很确定这是我在命令行中编写的格式，但我仍然无法正确使用它。请帮帮忙？

Answer 1

您可以使用以下github项目代码获取它：

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

从WARC.gz文件中提取标头

1 个答案: