是否可以在任何域名中添加“www”?

时间:2017-08-25 00:43:25

标签: python web pyqt

我修改了https://impythonist.wordpress.com/2015/01/06/ultimate-guide-for-scraping-javascript-rendered-web-pages/的网页抓取代码。

from PyQt4.QtCore import QUrl
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage
from sys import argv
from bs4 import BeautifulSoup

# Use result of rendering.
class Render(QWebPage):
  def __init__(self, url):
    self.app = QApplication(argv)
    QWebPage.__init__(self)
    self.loadFinished.connect(self._loadFinished)
    self.mainFrame().load(QUrl(url))
    self.app.exec_()

  def _loadFinished(self, result):
    self.frame = self.mainFrame()
    self.app.quit()

r = Render(url)
result = unicode(r.frame.toHtml().toUtf8(), encoding="UTF-8")

soup = BeautifulSoup(result, 'html.parser')

for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text().encode("utf-8")

使用此代码,我发现“nate.com”不会返回任何文本,但“www.nate.com”会返回。所以,我正在尝试为所有域添加“www”。

  1. 是否有一些我不应该在其域名中添加“www”的网站?
  2. (像这样)

    if "www" in url:
        url = url.split("www")[1]
    url = "www" + url
    
    1. (可选)为什么“nate.com”不会返回任何文本,但“www.nate.com”会这样做?我发现它用chrome重定向到“www.nate.com”。
    2. 欢迎任何评论。

1 个答案:

答案 0 :(得分:1)

  

是否有一些我不应该在其域名中添加“www”的网站?

是。例如huji.ac.il

$ http http://huji.ac.il
HTTP/1.1 200 OK
Accept-Ranges: bytes
Age: 94
Cache-Control: max-age=300
Connection: Keep-Alive
Content-Length: 173
Content-Type: text/html
Date: Fri, 25 Aug 2017 01:16:23 GMT
Expires: Fri, 25 Aug 2017 01:19:49 GMT
Server: Apache/2.2.15 (Red Hat)

<HTML>
<HEAD>
<meta http-equiv="refresh" content="0; URL=http://new.huji.ac.il">
</HEAD>
<BODY>
<a href="http://new.huji.ac.il">click here</a> jumping ....
</BODY>
</HTML>

好的,现在让我们试试www.huji.ac.il

$ http http://www.huji.ac.il
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=300
Connection: close
Content-Length: 173
Content-Type: text/html
Date: Fri, 25 Aug 2017 01:16:31 GMT
Expires: Fri, 25 Aug 2017 01:21:31 GMT
Server: Apache/2.2.15 (Red Hat)

<HTML>
<HEAD>
<meta http-equiv="refresh" content="0; URL=http://new.huji.ac.il">
</HEAD>
<BODY>
<a href="http://new.huji.ac.il">click here</a> jumping ....
</BODY>
</HTML>

无论如何,它会重定向到new.huji.ac.il,让我们使用www进行尝试:

$ http http://www.new.huji.ac.il

http: error: ConnectionError: HTTPConnectionPool(host='www.new.huji.ac.il', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f54f57fdd90>: Failed to establish a new connection: [Errno -2] Name or service not known',)) while doing GET request to URL: http://www.new.huji.ac.il/

因此,此示例中的www导致错误。

  

为什么“nate.com”不会返回任何文本,但“www.nate.com”会这样做?我发现它用chrome重定向到“www.nate.com”。

因为“nate.com”使用JavaScript进行重定向

$ http http://nate.com
HTTP/1.1 200 OK
Cache-Control: no-store, no-cache, must-revalidate
Connection: close
Content-Encoding: gzip
Content-Language: ko
Content-Length: 88
Content-Type: text/html; charset=utf-8
Date: Fri, 25 Aug 2017 01:13:34 GMT
Pragma: no-cache
Server: Apache
Vary: Accept-Encoding

<script type='text/javascript'>location.href='http://www.nate.com';</script>

正如评论中所指出:您应该在代码中添加一项功能以跟踪重定向。