Python无法通过广告分层计时器获取源代码

时间:2017-07-24 04:47:55

标签: python html cookies rocketscript

我不是编码员,我需要完成的只是获得一个完整的源代码。我不久前发现了这段代码,它一直很好用。但由于定时器的广告分层,它对某些网站起作用了。

{
  "Report" : {
    "-Kpne6Xp29_MkkB_-aPA" : {
      "date" : "Jul 24, 4:14 PM",
      "desc" : "bubble gum",
      "image" : "https://firebasestorage.googleapis.com/v0/b/report-f7656.appspot.com/o/Material%2Fcropped620487630.jpg?alt=media&token=e63a7fd8-b7ed-4277-9709-f5182f2db71d",
      "name" : "kenzo",
      "title" : "ditto",
      "type" : "Material",
      "uid" : "7DxLHiNhUUYwPTKnRx0ylJ2O8zb2"
    },
    "-Kpnj6Dm0uO_YR3qrB0V" : {
      "date" : "Jul 24, 4:36 PM",
      "desc" : "insta story",
      "image" : "https://firebasestorage.googleapis.com/v0/b/report-f7656.appspot.com/o/Material%2Fcropped2010735531.jpg?alt=media&token=940fedda-4333-4146-9a3f-a891a5ca412c",
      "name" : "kenzo",
      "title" : "ddlovato",
      "type" : "Material",
      "uid" : "7DxLHiNhUUYwPTKnRx0ylJ2O8zb2"
    }
  },
  "Users" : {
    "7DxLHiNhUUYwPTKnRx0ylJ2O8zb2" : {
      "name" : "kenzo"
    }
  }
}

但是我在Python 2.7控制台中打印出这个

import urllib2,cookielib

site= "http://example.com" #real url edited out

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
   'Accept-Encoding': 'none',
   'Accept-Language': 'en-US,en;q=0.8',
   'Connection': 'keep-alive'}

req = urllib2.Request(site, headers=hdr)

try:
   page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
   print e.fp.read()

content = page.read()
print content

1 个答案:

答案 0 :(得分:0)

我做的是转换成一个函数,它可以工作!!!

    def getHtml(url):
        import urllib2,cookielib
        hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
            'Accept-Encoding': 'none',
            'Accept-Language': 'en-US,en;q=0.8',
            'Connection': 'keep-alive',
            'Content-Type': 'application/x-www-form-urlencoded'}

        req = urllib2.Request(url, headers=hdr)

        try:
            page = urllib2.urlopen(req)
        except urllib2.HTTPError, e:
            print e.fp.read()

        html = page.read()
        #print html
        return html;

替代(慢),当我在互联网上看时,我发现你可以使用Python Selenium WebDriver使用Firefox或Chrome或Headless PhantomJS来获取html源代码。您需要在C:\ Python27 \ Scripts \

中放置GeckoDriver.exe或ChromeDriver.exe或PhantomJS.exe
def getHtmlViaWebDriver(url):
    from selenium import webdriver
    #print("Open Web Driver - External Head/less Browser PhantomJS or Firefox or Chrome")
    #driver = webdriver.Firefox(executable_path=r'C:\Python27\Scripts\geckodriver.exe')
    #driver = webdriver.Chrome(executable_path=r'C:\Python27\Scripts\chromedriver.exe')
    driver = webdriver.PhantomJS(executable_path=r'C:\Python27\Scripts\phantomJS.exe')
    html = driver.page_source.encode('utf-8')
    driver.quit()
    #print html
    return html;