Question

在代理服务器代码中，当将来自网站的HTTP请求转发到浏览器时，我想区分一个＆＃34; main＆＃34;或者＆＃34;顶级＆＃34; URL（位于浏览器地址栏中的URL）和嵌入页面中的嵌入页面，框架，横幅等。

我尝试使用content-type，如下所示（Python中的代码）：

class Proxy(http.server.SimpleHTTPRequestHandler):
    def do_GET(self):
        res = urllib.request.urlopen( self.path )
        cnttype = dict( res.getheaders() ).get( "Content-Type", "" )
        if cnttype != None and ( 
                cnttype.find( "text/html" ) == 0 or cnttype.find( "text/plain" )==0 ) :
            logger.debug( self.requestline )
        self.copyfile( res, self.wfile)

但是获取URL的内容类型并不会有帮助，因为嵌入式框架也可以具有text / html类型。

是否有一些元数据或可能是一些可靠的euristic方法来识别＆＃39; main＆＃39;页面并将它们与它们包含的内容的URL区分开来？

如何在http请求中区分普通网页和嵌入网页？

0 个答案: