Question

我有一个包含http标头的文档数据集。我想通过这些文档删除这些标题，同时留下其余的文本。我怎么能这样做？

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:58:44Z
WARC-TREC-ID: clueweb12-0000wb-76-38422
WARC-IP-Address: 207.241.148.80
WARC-Payload-Digest: sha1:W6JMWCNM43FDYNW466OADMH2KDGKJCGR
WARC-Target-URI: http://someurl.http
WARC-Record-ID: <urn:uuid:5a783f09-f0d8-4564-8f3a-c0d1ace7177b>
Content-Type: application/http; msgtype=response
Content-Length: 26043

HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:58:45 GMT
Server: Apache
Vary: *
PRAGMA: no-cache
P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI"
Cache-Control: max-age=-3600
Expires: Fri, 10 Feb 2012 20:58:45 GMT
Connection: close
Content-Type: text/html

Answer 1

这将按照你想要的做。它将保留原始文件并将清理后的版本放入新文件中。

datafile = 'test1.txt'
outputfile = 'output.txt'

with open(outputfile, encoding='utf-8', mode='w') as outfile:
    with open(datafile, encoding='utf-8', mode='r') as infile:
        foundhdrstart = False

        for line in infile:
            if line.strip() == 'WARC/1.0':
                foundhdrstart = True
            if foundhdrstart is False:
                outfile.write(line)
            if line.strip() == 'Content-Type: text/html':
                foundhdrstart = False

如何从文本文件中删除已定义的文本（http标头）

1 个答案: