GET请求在Burpsuite中有效,但在Python脚本中无效

时间:2018-10-30 00:19:26

标签: python cookies web-scraping burp

我正在尝试对此网页进行抓取:is defined as

为此,我通过Burpsuite运行了所有请求以捕获原始的HTTP请求,并发现此GET包含我想要的响应数据:

GET /bizjournals/topic/mergers-and-acquisitions HTTP/1.1
Host: www.bizjournals.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: _gada_id.37fd=42d27afa7b5d9d52.1539194129.8.1540854839.1540596813; _gada_ses.37fd=*; privAu=0; U=R7TBZKp5S9s7STboMrwVRw049388bd; bizj=YToxOntzOjM6IlVJTiI7czozMDoiUjdUQlpLcDVTOXM3U1Rib01yd1ZSdzA0OTM4OGJkIjt9%7C1539194126%7C7032491ea4fda1d2bd7f3cda3a4ae5f88dd3d132a2197a748809a91781089884; visid_incap_1008177=waeGk/2KQaidk/7H6do2t7Q8vlsAAAAAQUIPAAAAAAAbbMDj4s8FnJtt8m7/VcTJ; AMCV_653F60B351E568560A490D4D%40AdobeOrg=-330454231%7CMCIDTS%7C17834%7CMCMID%7C61913838891849233843857380508945284416%7CMCAAMLH-1541459638%7C6%7CMCAAMB-1541459638%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCCIDH%7C-979001332%7CMCOPTOUT-1540862038s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C3.1.2; _vwo_uuid_v2=D329C15B99F95E64CF23A2EC5AC8803A4|f5fc10c70802d9d4ffb4d97126b17e96; _vis_opt_s=2%7C; _vwo_uuid=D329C15B99F95E64CF23A2EC5AC8803A4; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241539194127%3A60.19717992%3A%3A%3A103_0%2C83_0; _ga=GA1.2.272521147.1539194129; _mkto_trk=id:673-UWY-229&token:_mch-bizjournals.com-1539194129257-86307; __idcontext=eyJkZXZpY2VJRCI6IjFCMjlISkl1dENXSFo2ZVZnSE5mblllNUtyMiIsImNvb2tpZUlEIjoiMUJPVlM5a2lja245M0d3SldPU0xHUkhBNGRIIn0%3D; __gads=ID=6df081cb1e53c3c2:T=1539194130:S=ALNI_MZ1kYzkPt8uOc0_19H5BClvb_E3hA; _sdsat_daysSinceLastVisit=More than 7 days; bounceClientVisit2080v=N4IgNgDiBcIBYBcEQM4FIDMBBNAmAYnvgO6kB0ARgJYBeAVgPYCuATgHYCGYKZAxgwFsi1es3Zd0BBAwhVeRAQFMWAc2UoAtBzYATLbwCOTKiioIqDNihAAaECxggQAXyA; _gid=GA1.2.10506591.1540854741; _fbp=fb.1.1540854745242.237098982; nlbi_1008177=WjWzcOvGYQ7vzlamyKF7DAAAAADGB7ODc0PZqBHCtInhVenh; incap_ses_982_1008177=g4ieMhtWcyZLgSiCQMSgDd2X11sAAAAAi1jzpH/BRLX/qhRrRo44TQ==; _vis_opt_test_cookie=1; AMCVS_653F60B351E568560A490D4D%40AdobeOrg=1
Connection: close
Upgrade-Insecure-Requests: 1

我将此邮件发送到中继器,它可以正常工作。当我在python脚本中使用它时,它也可以正常工作。代码:

def getBizJournals():
    link= "https://www.bizjournals.com/bizjournals/topic/mergers-and-acquisitions"

    cookies = dict(cookies_are = '_gada_id.37fd=42d27afa7b5d9d52.1539194129.8.1540854839.1540596813; _gada_ses.37fd=*; privAu=0; U=R7TBZKp5S9s7STboMrwVRw049388bd; bizj=YToxOntzOjM6IlVJTiI7czozMDoiUjdUQlpLcDVTOXM3U1Rib01yd1ZSdzA0OTM4OGJkIjt9%7C1539194126%7C7032491ea4fda1d2bd7f3cda3a4ae5f88dd3d132a2197a748809a91781089884; visid_incap_1008177=waeGk/2KQaidk/7H6do2t7Q8vlsAAAAAQUIPAAAAAAAbbMDj4s8FnJtt8m7/VcTJ; AMCV_653F60B351E568560A490D4D%40AdobeOrg=-330454231%7CMCIDTS%7C17834%7CMCMID%7C61913838891849233843857380508945284416%7CMCAAMLH-1541459638%7C6%7CMCAAMB-1541459638%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCCIDH%7C-979001332%7CMCOPTOUT-1540862038s%7CNONE%7CMCAID%7CNONE%7CvVersion%7C3.1.2; _vwo_uuid_v2=D329C15B99F95E64CF23A2EC5AC8803A4|f5fc10c70802d9d4ffb4d97126b17e96; _vis_opt_s=2%7C; _vwo_uuid=D329C15B99F95E64CF23A2EC5AC8803A4; _vwo_ds=3%3Aa_0%2Ct_0%3A0%241539194127%3A60.19717992%3A%3A%3A103_0%2C83_0; _ga=GA1.2.272521147.1539194129; _mkto_trk=id:673-UWY-229&token:_mch-bizjournals.com-1539194129257-86307; __idcontext=eyJkZXZpY2VJRCI6IjFCMjlISkl1dENXSFo2ZVZnSE5mblllNUtyMiIsImNvb2tpZUlEIjoiMUJPVlM5a2lja245M0d3SldPU0xHUkhBNGRIIn0%3D; __gads=ID=6df081cb1e53c3c2:T=1539194130:S=ALNI_MZ1kYzkPt8uOc0_19H5BClvb_E3hA; _sdsat_daysSinceLastVisit=More than 7 days; bounceClientVisit2080v=N4IgNgDiBcIBYBcEQM4FIDMBBNAmAYnvgO6kB0ARgJYBeAVgPYCuATgHYCGYKZAxgwFsi1es3Zd0BBAwhVeRAQFMWAc2UoAtBzYATLbwCOTKiioIqDNihAAaECxggQAXyA; _gid=GA1.2.10506591.1540854741; _fbp=fb.1.1540854745242.237098982; nlbi_1008177=WjWzcOvGYQ7vzlamyKF7DAAAAADGB7ODc0PZqBHCtInhVenh; incap_ses_982_1008177=TdaUBA39XV1omiaCQMSgDTOU11sAAAAAk0Dz+FXoB4BCz1M+ijAWGw==; _vis_opt_test_cookie=1; AMCVS_653F60B351E568560A490D4D%40AdobeOrg=1')


    response = requests.get(link, cookies=cookies)
    soup = BeautifulSoup(response.text, "lxml")
    headlines = []
    dateTimes = []
    for headline in soup.findAll('img', attrs={'data-src' : re.compile('.*'), 'alt' : re.compile('.*')}):
        headlines.append((headline['alt']))
    for dt in soup.findAll('time'):
        dateTimes.append((dt.text.lstrip()))
    for dt, headline in zip(dateTimes, headlines):
        print(dt)
        print(headline + "\r\n")

def main():
    getBizJournals()
main()

在大约20分钟内,该代码将正确多次返回结果,直到incap_ses_982_1008177 Cookie超时,因为它似乎位于封装的WAF之后。然后,我需要重新协商会话以获取新的cookie。

我没有得到的是,相同的GET仍然可以从Burpsuite的“ repeater”选项卡中继续工作。使用原始的incap_ses_982_1008177 Cookie值。如果我从不更新,则可以继续发出请求并获得结果。但是,如果我尝试从脚本中使用相同的Cookie(约20分钟后),则它什么也不会返回。

所以我想“不用担心”,我将添加一些逻辑以获取新的cookie并添加以下代码以将其解析出来:

cookies = cookielib.LWPCookieJar()
handlers = [
    urllib2.HTTPHandler(),
    urllib2.HTTPSHandler(),
    urllib2.HTTPCookieProcessor(cookies)
    ]
opener = urllib2.build_opener(*handlers)

def fetch(uri):
    req = urllib2.Request(uri)
    return opener.open(req)

def dump():
    for cookie in cookies:
        print cookie.name, cookie.value

uri = 'https://www.bizjournals.com/bizjournals/topic/mergers-and-acquisitions'
res = fetch(uri)
dump()

但是,当我获取生成的cookie并将其插入脚本时,它会失败,但是在Burp中它可以工作!我已经头撞墙了几天了。有谁知道是什么原因导致这种行为,以及是否有办法可靠地在incapsuala后面抓取页面?

0 个答案:

没有答案