无法使用python下载cookie保护的文件

时间:2013-06-11 14:05:53

标签: python cookies download session-cookies urllib2

我整天都想解决这个问题。 有http://www.some.site/index.php请求用户和密码+发送cookie。好吧,我这样进来了:

import urllib, urllib2, cookielib, os
import re # not required here but tried it out though
import requests # not required here but tried it out though
username = 'somebody'
password = 'somepass'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
resp = opener.open('http://www.some.site/index.php', login_data)
print resp.read()

问题是在屏幕中间有一个下载.xls文件的链接:http://www.some.site/excel_file.php?/t=1303457489。我可以在任何浏览器(Mozilla,Chrome,IE)中下载该文件,但不能使用Python。在.php之后,发布数据(即:?t = 1370919996)在我登录或刷新页面时一直在变化。

也许我错了,但我相信发布数据是从cookie(或会话cookie)生成的,但cookie只包含这个:('set-cookie', 'PHPSESSID=9cde55534fcc8e136fcf6588c0d0f1df; path=/')

这是我尝试保存文件的一种方式:

print "downloading with urllib2"
f = urllib2.urlopen('http://www.some.site/excel_file.php')
data = f.read()
with open("exceldoc.xls", "wb") as code:
    code.write(data)

如果我保存它或者我将其打印出来会产生相同的错误请求错误:

<b>Fatal error</b>:  Call to a member function FetchRow() on a non-object in <b>http://www.some.site/excel_file.php</b> on line <b>112</b><br 

如何使用Python下载此文件?非常感谢您提前寻求帮助!

有许多类似的帖子,我已经检查了它们,我的例子受到启发,但对我来说没什么用。我对cookies,php,js。

不是很熟悉

编辑:这是我打印出index.php的内容时得到的:

<html>
<head>
<title>SOMETITLE</title>
<meta http-equiv="Page-Enter" content="blendTrans(Duration=0.5)">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel='stylesheet' type='text/css' href='somesite.css'>
<SCRIPT LANGUAGE="JavaScript">
<!-- JavaScript hiding

function clearDefault(obj) {
    if (!obj._cleared) {
                obj.value='';
                obj._cleared=true;
    }
}

// -->
</SCRIPT>
</head>

<body bgcolor="#FFFFFF" text="#000000">

<table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
  <tr>
    <td>
      <table width="1000" height="150" border="0" align="center" cellpadding="16" cellspacing="0" class="header" style="background: #989896 url('images/header.png') no-repeat;">
        <tr>
          <td valign="middle">
            <table width="100%" border="0" align="center" cellpadding="0" cellspacing="0">
              <tr>
                <td width="380">&nbsp;</td>
                <td>
                  <div id="login">
                       <form name="flogin" method="post" action="/index.php">
                      <h1>Login</h1>
                      <input name="uName" type="text" value="Username:" class="name" onfocus="clearDefault(this)">
                      <br>
                      <input type="password" name="uPw"  value="Password:" class="pass" onfocus="clearDefault(this)">
                      <input type="submit" name="Submit" value="OK" class="submit">
                    </form>
                  </div>                                                                
                                                                </td>
              </tr>
            </table>
          </td>
        </tr>
      </table>
                </td>
  </tr>
</table>

</body>
</html>

1 个答案:

答案 0 :(得分:1)

您可以尝试解析第一个代码部分的响应,并使用提取的网址使用相同的opener。不知道链接的实际格式:

import urllib, urllib2, cookielib, os
import re # going to use this now!

username = 'somebody'
password = 'somepass'

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'j_password' : password})
resp = opener.open('http://www.some.site/index.php', login_data)
content = resp.read()
print content

match = re.search(
    r"<a\s+href=\"(?P<file_link>http://www.some.site/excel_file.php?t=\d+)\">",
    content,
    re.IGNORECASE
)

assert match is not None, "Couldn't find the file link..."

file_link = match.group('file_link')
print "downloading {} with urllib2".format(file_link)
f = opener.open(file_link)
data = f.read()
with open("exceldoc.xls", "wb") as code:
    code.write(data)