Question

from bs4 import BeautifulSoup
import requests
url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)

我想到达div中的内容（“ class = content”）（p）但是当我打印出r.text时，有很大一部分消失了。但是我还发现，如果我打开一个文本文件并将其写入，它将恰好在笔记本中

doc = open("file104.txt", "w", encoding="utf-8")
doc.write(r.text)
doc.close()

我想可能是编码问题？但是在我用utf-8编码后，它仍然无法正常工作。

对不起，大家好！

================================================ ==========================

我终于发现了来自Ipython IDLE的问题，如果我在powershell中运行代码，一切都会好起来的，我应该更早尝试一下。...

但是仍然想知道为什么会导致这个问题！

Answer 1

使用content.decode()

    >>> import requests
    >>> url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
    >>> r = requests.get(url)
    >>> TextInfo = r.content.decode('UTF-8')
    >>> print(TextInfo)
    <!DOCTYPE html>
    <!--[if lt IE 7]>     <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
    <!--[if IE 7]>        <html class="lt-ie9 lt-ie8"> <![endif]-->
    <!--[if IE 8]>        <html class="lt-ie9"> <![endif]-->
    <!--[if gt IE 8]><!--><html lang="zh-tw"><!--<![endif]-->
    <head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    <meta http-equiv="pragma" content="no-cache" />
    <meta http-equiv="cache-control" content="no-cache" />

.....
.....

the guts of the html code

.....
.....

    </script>
    </body>
    </html>

    >>>

Answer 2

from bs4 import BeautifulSoup
import urllib.request
url = "https://www.104.com.tw/job/?jobno=5mjva&     
jobsource=joblist_b_relevance"
r = urllib.request.urlopen(url).read()
r=r.decode('utf-8')
print(r)
                         #OR
urllib.request.urlretrieve(url,"myhtml.html")
myhtml=open(myhtml.html,'rb')
print(myhtml)

python 3.6中的request.get.text错误确实需要一些帮助

2 个答案: