Question

我正在使用BeautifulSoup4处理html页面。 html个文件顶部包含request headers个信息，如何过滤掉它？

这里是html文件摘要

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-17T03:07:46Z
WARC-TREC-ID: clueweb12-0206wb-51-29582
WARC-Record-ID: <urn:uuid:546b127c-040e-4dee-a565-3a3f6683f898>
Content-Type: application/http; msgtype=response
Content-Length: 29032

HTTP/1.1 200 OK
Cache-Control: private
Connection: close
Date: Fri, 17 Feb 2012 03:07:48 GMT
Content-Length: 28332
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
X-AspNet-Version: 2.0.50727
Set-Cookie:         chkvalues=ClmZLoF4xnHoBwiZnWFzYcCMoYB/fMxYfeeJl/zhlypgwivOzw6qnVBRWzf8f19O; expires=Wed, 15-Aug-2012 02:07:48 GMT; path=/
Set-Cookie: previous-category-id=11; expires=Fri, 17-Feb-2012 03:27:48
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" >
<head id="ctl00_headTag"><title>

我想在<html></html>之间提取文字。当我尝试做这样的事情时。

with codecs.open(file, 'r', 'utf-8', errors='ignore') as f:
        contents = f.read()
    soup = BeautifulSoup(contents, "lxml")
    for script in soup.find_all(["script", "style"]):  # to remove script style tags
        script.extract()
    try:
        raw_text = soup.find('html').text.lower()

    except AttributeError:
        pprint('{0} file is empty'.format(file))

它在raw_text填满了

"WARC/1.0\r\nWARC-Type: response\r\nWARC-Date: 2012-02-17T03:07:46Z....类似信息，意味着它会将标题添加到raw_text。

如何从原始文本中删除该标题内容。

Answer 1

HTTP标头与正文分开两个换行符，因此您可以使用\r\n\r\n拆分数据。但是，您的文件包含请求和响应，并且将主体的开头用作分隔符会更容易。

try:
    contents = contents[contents.index('<!DOCTYPE'):]
except ValueError:
    contents = contents[contents.index('<html'):]
soup = BeautifulSoup(contents, "lxml")

某些html文档可能没有DOCTYPE声明。在这种情况下，在'<html'块中包装所有块后，使用try except作为索引。

Answer 2

'\n'.join([e for e in raw_text.split('\n') if (e and e[0]=="<")])

您可以使用此列表理解来确保每行以<

开头

忽略原始文本python BeautifulSoup中的头文本

2 个答案: