python post返回的页面与浏览器的页面不同

时间:2014-07-24 14:47:47

标签: python http post cookies header

我正在尝试以编程方式将一个基因列表发送到知名网站DAVID(http://david.abcc.ncifcrf.gov/summary.jsp)进行功能注释。虽然还有其他两种方式--API服务(http://david.abcc.ncifcrf.gov/content.jsp?file=DAVID_API.html)和Web服务(http://david.abcc.ncifcrf.gov/content.jsp?file=WS.html),但前者具有更严格的查询限制,后者不接受我的ID类型({{3} }),所以唯一的选择似乎是发布表单的程序,解析生成的页面并提取下载链接。使用firefox插件'httpFox'监视传输,我尝试使用以下脚本:

import urllib
import urllib2
import requests as rq
import time

_n = 1
url0 = 'http://david.abcc.ncifcrf.gov'
url = 'http://david.abcc.ncifcrf.gov/summary.jsp'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:30.0) Gecko/20100101 Firefox/30.0'

def get_cookie(session_id): # prepare 'Cookie' in the headers for the post
    domain_hash = '260267544' # according to what's been sent by firefox 
    random_uid = '1113731634' # according to what's been sent by firefox
    global _t0
    init_time = _t0
    global _t 
    prev_time = _t
    _t = int(time.time())
    curr_time = _t
    global _n
    _n += 1
    session_count = _n
    campaign_count = 1
    utma = '.'.join(str(x) for x in (domain_hash, random_uid, init_time, prev_time, curr_time, session_count))
    utmz = '.'.join(str(x) for x in (domain_hash, init_time, session_count, campaign_count, 'utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'))
    cookie = '; '.join(str(x) for x in ('__utma=' + utma, '__utmz=' + utmz, 'JSESSIONID=' + session_id)) 
    return(cookie)

# first get the session ID
_t = int(time.time())
_t0 = _t
headers = {'User-Agent' : user_agent}
r = rq.get(url, headers = headers) 
session_id = r.cookies['JSESSIONID']
cookie = get_cookie(session_id)

# get the gene list
gene = []
fh = open('list.txt', 'r')
for line in fh:
    gene.append(line.rstrip('\n'))

fh.close()

# then post the form
headers = {  # all below is according to what's been sent by firefox
           'Host' : 'david.abcc.ncifcrf.gov',
           'User-Agent' : user_agent, 
           'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 
           'Accept-Language' : 'en-US,en;q=0.5', 
           'Accept-Encoding' : 'gzip, deflate',
           'Referer' : url,
           'Cookie': cookie, 
           'Connection' : 'keep-alive', 
#           'Content-Type' : 'multipart/form-data; boundary=---------------------------17914945481928137296675300642',
#           'Content-Length' : '3581'
           }

data = {  # all below is according to what's been sent by firefox
        'idType' : 'OFFICIAL_GENE_SYMBOL',
        'uploadType' : 'list', 
        'multiList' : 'false', 
        'Mode' : 'paste', 
        'useIndex' : 'null',
        'usePopIndex' : 'null', 
        'demoIndex' : 'null', 
        'ids' : '\n'.join(gene), 
        'removeIndex' : 'null', 
        'renameIndex' : 'null', 
        'renamePopIndex' : 'null', 
        'newName' : 'null', 
        'combineIndex' : 'null', 
        'selectedSpecies' : 'null', 
        'SESSIONID' : session_id[-12:], # according to the pattern that the last 12 characters of 'JSESSIONID' is sent by firefox
        'uploadHTML' : 'null', 
        'managerHTML' : 'null', 
        'sublist' : '',
        'rowids' : '',
        'convertedListName' : 'null', 
        'convertedPopName' : 'null', 
        'pasteBox' : '\n'.join(gene), 
        'fileBrowser' : '', 
        'Identifier' : 'OFFICIAL_GENE_SYMBOL', 
        'rbUploadType' : 'list'}

r = rq.post(url = url, data = data, headers = headers)
if r.status_code == 200:
    fh = open("python.html", 'w')
    fh.write(r.text)
    fh.close()

但是,我的代码得到的页面是272KB,与httpFox返回的内容完全不同,后者为428KB。我比较了我的脚本和firefox发送的标题和表单,区别似乎只是在

  1. Cookie字段__utma和__utmz,但它们与谷歌分析有关,听起来它们并不重要,
  2. 我评论的第二个标题中的字段'Content-Type'和'Content-Length'。由于http://david.abcc.ncifcrf.gov/forum/viewtopic.php?f=14&t=885中的建议,似乎没有必要手动指定它们。然而,即使在我评论它们之后,它也不起作用。
  3. 以上是基本情况,如果有人可以帮助弄清问题的具体位置,我将不胜感激。此外,我已经看到了一些其他建议,例如尝试浏览器模拟器'mechanize'。但是我对这个原因更加好奇,即我的程序出了什么问题,如果是这样,如何纠正它,或者这些模块根本不足以完成任务?非常感谢。

    我要发布的列表是:

    Apba3
    Apoa1bp
    Dexi
    Dhps
    Dnpep
    Eral1
    Gcsh
    Git1
    Grtp1
    Guk1
    Ifrd2
    Lsm3
    Map2k1ip1
    Med31
    Mettl11a
    Mrpl2
    mrpl24
    Mrpl30
    Mrpl46
    Ndufaf3
    Nr1h2
    Obfc2b
    Parp3
    Pigt
    Pop5
    Ppt2
    Ptpmt1
    RGD1304567
    RGD1306215
    RGD1309708
    Rras
    

    我的浏览器发布程序是:

    1. 在firefox中打开Is Python requests doing something wrong here, or is my POST request lacking something?
    2. 默认情况下,在左侧面板中
    3. ,在“步骤1:输入基因列表A:粘贴列表”框中输入上述基因列表
    4. 点击下拉按钮,然后在“第2步:选择标识符”中选择“OFFICIAL_GENE_SYMBOL”
    5. 检查“步骤3:列表类型”
    6. 中的单选按钮“基因列表”
    7. 点击“第4步:提交清单”
    8. 中的“提交清单”

      然后浏览器返回一个带有弹出窗口的新页面,提示用户选择物种和背景,这是httpFox在这篇文章中跟踪的内容,也是我试图通过我的脚本捕获的内容。

1 个答案:

答案 0 :(得分:1)

使用Selenium

from selenium import webdriver
from time import sleep

driver = webdriver.Firefox()
driver.get('http://david.abcc.ncifcrf.gov/summary.jsp')
sleep(0.1)
query = """Apba3
Apoa1bp
Dexi
Dhps
Dnpep
Eral1
Gcsh
Git1
Grtp1
Guk1
Ifrd2
Lsm3
Map2k1ip1
Med31
Mettl11a
Mrpl2
mrpl24
Mrpl30
Mrpl46
Ndufaf3
Nr1h2
Obfc2b
Parp3
Pigt
Pop5
Ppt2
Ptpmt1
RGD1304567
RGD1306215
RGD1309708
Rras"""
listBox = driver.find_element_by_id("LISTBox")
listBox.send_keys(query)

IDT = driver.find_element_by_id("IDT")
IDT.send_keys("O")

radioCheck = driver.find_element_by_name("rbUploadType")
radioCheck.click()


submitButton = driver.find_element_by_name("B52")

submitButton.click()
sleep(0.1)
alert = driver.switch_to_alert()
alert.accept()
sleep(0.1)
html = driver.page_source

变量“html”包含页面源。