在请求具有请求的URL时规范化HTML内容

时间:2017-03-25 18:36:38

标签: python html python-3.x python-requests

我正在使用Python 3.6中的请求来获取HTML内容,使用以下代码:

import requests
url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)

但是,输出的内容很奇怪,有许多“\ n”字符:

 b'<!DOCTYPE html>\n<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle"  itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->\n<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->\n<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> 
 <![endif]-->\n<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]
 -->\n<head>\n ..."

如何修复它以获得完整的标准HTML输出?

1 个答案:

答案 0 :(得分:0)

使用response.text代替response.content - 如下面引用的请求文档中所述,这将使用HTTP响应提供的编码信息将响应内容解码为Unicode字符串:

  

content

     

响应内容,以字节为单位。

  

text

     

响应的内容,以unicode为单位。

     

如果Response.encoding为None,则使用chardet猜测编码。

     

响应内容的编码仅根据HTTP标头确定,遵循RFC 2616到字母。如果您可以利用非HTTP知识来更好地猜测编码,则应在访问此属性之前适当地设置r.encoding。

示例:

import requests
url = 'https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.text)

输出:

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--> <html lang="en" class="no-js section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" itemid="https://www.nytimes.com/2017/03/17/world/europe/trump-britain-obama-wiretap-gchq.html" itemtype="http://schema.org/NewsArticle"  itemscope xmlns:og="http://opengraphprotocol.org/schema/"> <!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 section-europe format-medium tone-news app-article page-theme-standard has-comments has-top-ad type-size-small has-large-lede" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<head>
    <title>Trump Offers No Apology for Claim on British Spying - The New York Times</title>
      <!-- etc ... -->
</body>
</html>