为什么python的`urllib.request`中的网站响应与直接从网络浏览器发送的请求不同?

时间:2014-02-22 18:11:41

标签: python web browser urllib

我有一个程序,它接受一个URL并使用urllib.request从服务器获取响应。这一切都很好,但我测试了一点,并意识到,当我在浏览器中输入http://google.com这样的URL时,我得到了一个不同的页面(有涂鸦和科学公平促销等)但是对于我的程序,它只是简单的谷歌没有什么特别的。

这可能是由于重定向,但如果来自我的程序的请求通过相同的路由器和DNS,那么输出肯定应该完全相同?

以下是代码:

"""
This is a simple browsing widget that handles user requests, with the
added condition that all proxy settings are ignored. It outputs in the
default web browser.
"""

# This imports some necessary libraries.
import tkinter as tk
import webbrowser

from tempfile import NamedTemporaryFile
import urllib.request


def parse(data):
    """
    Removes junk from the data so it can be easily processed.
    :rtype : list
    :param data: A long string of compressed HTML.
    """
    data = data.decode(encoding='UTF-8')  # This makes data workable.
    lines = data.splitlines()  # This clarifies the lines for writing.
    return lines


class Browser(object):
    """This creates an object for getting a direct server response."""
    def __init__(self, master):
        """
        Sets up a direct browsing session and a GUI to manipulate it.
        :param master: Any Tk() window in which the GUI is displayable.
        """
        # This creates a frame within which widgets can be stored.
        frame = tk.Frame(master)
        frame.pack()

        # Here we create a handler that ignores proxies.
        proxy_handler = urllib.request.ProxyHandler(proxies=None)
        self.opener = urllib.request.build_opener(proxy_handler)

        # This sets up components for the GUI.
        tk.Label(frame, text='Full Path').grid(row=0)
        self.url = tk.Entry(frame)  # This takes the specified path.
        self.url.grid(row=0, column=1)
        tk.Button(frame, text='Go', command=self.browse).grid(row=0, column=2)

        # This binds the return key to calling the method self.browse.
        master.bind('<Return>', self.browse)

    def navigate(self, query):
        """
        Gets raw data from the queried server, ready to be processed.
        :rtype : str
        :param query: The request entered into 'self.url'.
        """
        # This contacts the domain and parses it's response.
        response = self.opener.open(query)
        html = response.read()
        return html

    def browse(self, event=None):
        """
        Wraps all functionality together for data reading and writing.
        :param event: The argument from whatever calls the method.
        """
        # This retrieves the input given by the user.
        location = self.url.get()
        print('\nUser inputted:', location)
        # This attempts to access the server and gives any errors.
        try:
            raw_data = self.navigate(location)
        except Exception as e:
            print(e)
        # This executes assuming there are no errors.
        else:
            clean_data = parse(raw_data)
            # This creates and executes a temporary HTML file.
            with NamedTemporaryFile(suffix='.html', delete=False) as cache:
                cache.writelines(line.encode('UTF-8') for line in clean_data)
                webbrowser.open_new_tab(cache.name)
                print('Done.')


def main():
    """Using a main function means not doing everything globally."""
    # This creates a window that is always in the foreground.
    root = tk.Tk()
    root.wm_attributes('-topmost', 1)
    root.title('DirectQuery')

    # This starts the program.
    Browser(root)
    root.mainloop()

# This allows for execution as well as for importing.
if __name__ == '__main__':
    main()

注意:我不知道是否与指示忽略代理这一事实有关?我的电脑没有打开任何代理设置。此外,如果有一种方法可以获得与Chrome浏览器等网络浏览器相同的响应/输出,我很乐意听到它。

1 个答案:

答案 0 :(得分:3)

为了回答您的一般性问题,您需要了解相关网站的运作方式,因此这不是一个真正的Python问题。网站经常使用特殊检测代码检测浏览器的“品牌和型号”,通常(如问题评论中所示)从User-Agent:HTTP标头开始。

如果用户代理将自己标识为程序,Google的主页将不包含任何基于JavaScript的功能。