Question

    import urllib

    #my url here stored as url

    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    print(htmltext)

我试图从网址获取源代码

我获得了源代码，但它来自另一个页面，说两件事; 请启用Cookie和此域名已根据您的浏览器签名禁止您访问

当浏览器知道你实际上没有在页面上时，有没有人知道获取源代码？

Answer 1

您可能需要设置网址开启者

def createOpener(self):
        handlers = []                                                       
        cj = MyCookieJar();
        cj.set_policy(cookielib.DefaultCookiePolicy(rfc2965=True))
        cjhdr = urllib2.HTTPCookieProcessor(cj)
        handlers.append(cjhdr)
        opener = urllib2.build_opener(*handlers)
        opener.addheaders = [('User-Agent', self.getUserAgent()),
                                  ('Host', 'google.com')]
        return opener

饼干罐

class MyCookieJar(cookielib.CookieJar):
    def _cookie_from_cookie_tuple(self, tup, request):
        name, value, standard, rest = tup
        version = standard.get('version', None)
        if version is not None:
            version = version.replace('"', '')
            standard["version"] = version
        return cookielib.CookieJar._cookie_from_cookie_tuple(self, tup, request)

此时，您创建了开启者并获取读取url处理程序的数据，如：

def fetchURL(self, url, data=None, headers={}):
        request = urllib2.Request(url, data, headers)
        self.opener = self.createOpener()
        urlHandle = self.opener.open(request)
        return urlHandle.read()

拥有User-Agent列表并从中读取是个好主意：

with open(ffpath) as f:
    USER_AGENTS_LIST = f.read().splitlines()

从中获取一个随机的

index = random.randint(0,len(USER_AGENTS_LIST)-1)
uA=USER_AGENTS_LIST[index]

要获得用户代理列表，请查看here。

这只是为了在没有任何外部框架的情况下做到这一点。

使用Python cookie获取HTML源代码

1 个答案: