我有一个包含http标头的文档数据集。我想通过这些文档删除这些标题,同时留下其余的文本。我怎么能这样做?
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:58:44Z
WARC-TREC-ID: clueweb12-0000wb-76-38422
WARC-IP-Address: 207.241.148.80
WARC-Payload-Digest: sha1:W6JMWCNM43FDYNW466OADMH2KDGKJCGR
WARC-Target-URI: http://someurl.http
WARC-Record-ID: <urn:uuid:5a783f09-f0d8-4564-8f3a-c0d1ace7177b>
Content-Type: application/http; msgtype=response
Content-Length: 26043
HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:58:45 GMT
Server: Apache
Vary: *
PRAGMA: no-cache
P3P: CP="IDC DSP COR DEVa TAIa OUR BUS UNI"
Cache-Control: max-age=-3600
Expires: Fri, 10 Feb 2012 20:58:45 GMT
Connection: close
Content-Type: text/html
答案 0 :(得分:1)
这将按照你想要的做。 它将保留原始文件并将清理后的版本放入新文件中。
datafile = 'test1.txt'
outputfile = 'output.txt'
with open(outputfile, encoding='utf-8', mode='w') as outfile:
with open(datafile, encoding='utf-8', mode='r') as infile:
foundhdrstart = False
for line in infile:
if line.strip() == 'WARC/1.0':
foundhdrstart = True
if foundhdrstart is False:
outfile.write(line)
if line.strip() == 'Content-Type: text/html':
foundhdrstart = False