我有一个程序,它接受一个URL并使用urllib.request
从服务器获取响应。这一切都很好,但我测试了一点,并意识到,当我在浏览器中输入http://google.com这样的URL时,我得到了一个不同的页面(有涂鸦和科学公平促销等)但是对于我的程序,它只是简单的谷歌没有什么特别的。
这可能是由于重定向,但如果来自我的程序的请求通过相同的路由器和DNS,那么输出肯定应该完全相同?
以下是代码:
"""
This is a simple browsing widget that handles user requests, with the
added condition that all proxy settings are ignored. It outputs in the
default web browser.
"""
# This imports some necessary libraries.
import tkinter as tk
import webbrowser
from tempfile import NamedTemporaryFile
import urllib.request
def parse(data):
"""
Removes junk from the data so it can be easily processed.
:rtype : list
:param data: A long string of compressed HTML.
"""
data = data.decode(encoding='UTF-8') # This makes data workable.
lines = data.splitlines() # This clarifies the lines for writing.
return lines
class Browser(object):
"""This creates an object for getting a direct server response."""
def __init__(self, master):
"""
Sets up a direct browsing session and a GUI to manipulate it.
:param master: Any Tk() window in which the GUI is displayable.
"""
# This creates a frame within which widgets can be stored.
frame = tk.Frame(master)
frame.pack()
# Here we create a handler that ignores proxies.
proxy_handler = urllib.request.ProxyHandler(proxies=None)
self.opener = urllib.request.build_opener(proxy_handler)
# This sets up components for the GUI.
tk.Label(frame, text='Full Path').grid(row=0)
self.url = tk.Entry(frame) # This takes the specified path.
self.url.grid(row=0, column=1)
tk.Button(frame, text='Go', command=self.browse).grid(row=0, column=2)
# This binds the return key to calling the method self.browse.
master.bind('<Return>', self.browse)
def navigate(self, query):
"""
Gets raw data from the queried server, ready to be processed.
:rtype : str
:param query: The request entered into 'self.url'.
"""
# This contacts the domain and parses it's response.
response = self.opener.open(query)
html = response.read()
return html
def browse(self, event=None):
"""
Wraps all functionality together for data reading and writing.
:param event: The argument from whatever calls the method.
"""
# This retrieves the input given by the user.
location = self.url.get()
print('\nUser inputted:', location)
# This attempts to access the server and gives any errors.
try:
raw_data = self.navigate(location)
except Exception as e:
print(e)
# This executes assuming there are no errors.
else:
clean_data = parse(raw_data)
# This creates and executes a temporary HTML file.
with NamedTemporaryFile(suffix='.html', delete=False) as cache:
cache.writelines(line.encode('UTF-8') for line in clean_data)
webbrowser.open_new_tab(cache.name)
print('Done.')
def main():
"""Using a main function means not doing everything globally."""
# This creates a window that is always in the foreground.
root = tk.Tk()
root.wm_attributes('-topmost', 1)
root.title('DirectQuery')
# This starts the program.
Browser(root)
root.mainloop()
# This allows for execution as well as for importing.
if __name__ == '__main__':
main()
注意:我不知道是否与指示忽略代理这一事实有关?我的电脑没有打开任何代理设置。此外,如果有一种方法可以获得与Chrome浏览器等网络浏览器相同的响应/输出,我很乐意听到它。
答案 0 :(得分:3)
为了回答您的一般性问题,您需要了解相关网站的运作方式,因此这不是一个真正的Python问题。网站经常使用特殊检测代码检测浏览器的“品牌和型号”,通常(如问题评论中所示)从User-Agent:HTTP标头开始。
如果用户代理将自己标识为程序,Google的主页将不包含任何基于JavaScript的功能。