Question

我目前正在编写一个程序，帮助用户确定在tumblr上发布帖子的最佳时间。与推特一样，大多数粉丝拥有如此多的订阅，以至于他们无法跟上，这意味着最好知道一个人自己的特定关注时间（主要是）在线。在tumblr上，这可以通过两种方式确定 - 首先是他们最近是否分享了最近发布的内容，其次是他们最近是否已添加到他们喜欢的帖子列表中。

令人沮丧的是，即使设置为“公开”，任意用户（非自己）的喜欢帖子流也仅适用于登录实体。据我所知，这意味着我要经常上传一个登录cookie到应用程序，或者让这个请求后工作。

我通过Opera的检查员查看了一些成功的出站请求，但我仍然遗漏了一些东西，或者请求正在做一些服务器拒绝的事情，无论我做什么。 / p>

问题的实质如下。这是目前用Python 2.7编写的并使用Python requests和BeautifulSoup。要自己运行，请将get_login_response（）顶部的e和p对更新为一组实际值。

import requests
from bs4 import BeautifulSoup

class Login:

    def __init__(self):
        self.session = requests.session()

    def get_hidden_fields(self):
        """ -> string. tumblr dynamically generates a key for its login forms
        This should extract that key from the form so that the POST-data to
        login will be accepted.
        """
        pageRequest = requests.Request("GET","https://www.tumblr.com/login")
        received = self.session.send( pageRequest.prepare() )
        html = BeautifulSoup(received.content)
        hiddenFieldDict = {}
        hiddenFields = html.find_all("input",type="hidden")
        for x in hiddenFields: hiddenFieldDict[x["name"]]=x["value"]
        return hiddenFieldDict

    def get_login_response(self):
        e = u"dead@live.com"
        p = u"password"
        endpoint = u"https://tumblr.com/login"
        payload = { u"user[email]": e,
                    u"user[password]": p,
                    u"user[age]":u"",
                    u"tumblelog[name]": u"",
                    u"host": u"www.tumblr.com",
                    u"Connection:":u"keep-alive",
                    u"Context":u"login",
                    u"recaptcha_response_field":u""
                  }
        payload.update( self.get_hidden_fields() )
    ##        headers = {"Content-Type":"multipart/form-data"}
        headers = {u"Content-Type":u"application/x-www-form-urlencoded",
                   u"Connection:":u"keep-alive",
                   u"Origin":u"https://tumblr.com",
                   u"Referer": u"https://www.tumblr.com/login",
                   u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68",
                   u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                   u"Accept-Encoding":u"gzip,deflate,sdch",
                   u"Accept-Language":u"en-US,en;q=0.8",
                   u"Cache-Control":u"max-age=0"
                   #"Content-Length":VALUE is still needed
                   }
        # this cookie is stale but it seems we these for free anyways,
        #  so I'm not sure whether it's actually needed. It's mostly
        #  google analytics info.
        sendCookie = {"tmgioct":"52c720e28536530580783210",
                      "__qca":"P0-1402443420-1388781796773",
                      "pfs":"POIPdNt2p1qmlMGRbZH5JXo5k",
                      "last_toast":"1388783309",
                      "capture":"GDTLiEN5hEbMxPzys1ye1Gf4MVM",
                      "logged_in":"0",
                      "_ga":"GA1.2.2064992906.1388781797",
                      "devicePixelRatio":"1",
                      "documentWidth":"1280",
                      "anon_id":"VNHOJWQXGTQXHNCFKYJQUMUIVQBRISPR",
                      "__utma":"189990958.2064992906.1388781797.1388781797.1388781797.1",
                      "__utmb":"189990958.28.10.1388781797",
                      "__utmc":"189990958",
                      "__utmz":"189990958.1388781797.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"}
        loginRequest = requests.Request("POST",
                                        endpoint,
                                        headers,
                                        data=payload,
                                        cookies=sendCookie # needed?
##                                        ,auth=(e,p) # may not be needed
                                        )

        contentLength = len(loginRequest.prepare().body)
        loginRequest.data.update({u"Content-Length":unicode(contentLength)})
        return self.session.send( loginRequest.prepare() )

l = Login()
res = l.get_login_response()
print "All cookies: ({})".format(len(l.session.cookies))
print l.session.cookies # has a single generic cookie from the initial GET query
print "Celebrate if non-empty:"
print res.cookies # this should theoretically contain the login cookie

我的结果输出：

All cookies: (1)
<<class 'requests.cookies.RequestsCookieJar'>[<Cookie tmgioct=52c773ed65cfa30622446430 for www.tumblr.com/>]>
Celebrate if non-empty:
<<class 'requests.cookies.RequestsCookieJar'>[]>

如果我的代码不安全，你还可以获得奖励积分。我选择了请求模块，因为它简单，但如果它缺少功能，我的目标是使用 httplib2 或我愿意切换的东西。

Answer 1

你需要做的事情有很多，而且做了很多事情。

首先，返回并检查您的登录请求中发送的POST字段。当我在Chrome中执行此操作时，我会看到以下内容：

user[email]:<redacted>
user[password]:<redacted>
tumblelog[name]:
user[age]:
recaptcha_public_key:6Lf4osISAAAAAJHn-CxSkM9YFNbirusAOEmxqMlZ
recaptcha_response_field:
context:other
version:STANDARD
follow:
http_referer:http://www.tumblr.com/logout
form_key:!1231388831237|jS7l2SHeUMogRjxRiCbaJNVduXU
seen_suggestion:0
used_suggestion:0

您的基于请求的POST缺少其中一些字段，特别是recaptcha_public_key，version，follow，http_referer，form_key，{{1 }和seen_suggestion。

这些字段不是可选的：它们需要在此POST上发送。其中一些可以安全地使用，但最安全获取这些内容的方法是获取登录页面本身的数据，并使用BeautifulSoup从HTML中提取值。我将假设您已经掌握了相应的技能（例如，您知道如何在HTML中查找表单输入并解析它们以获取其默认值）。

进入这里的一个好习惯是开始使用Wireshark或tcpdump之类的工具来检查您的HTTP流量请求，并将其与Chrome / Opera的结果进行比较。这将允许您查看发送和不发送的内容，以及这两个请求的不同之处。

其次，一旦您开始点击登录页面，您就不需要在POST上发送cookie，因此您可以停止这样做。更一般地说，当使用请求used_suggestion对象时，您不应该输入任何其他cookie：只是模拟来自实际浏览器的HTTP请求流，您的cookie状态就可以了。

第三，你大量过度指定你的标题词典。您提供的大多数字段将由请求自动填充。现在，假设您正在尝试模拟浏览器（Opera的外观），您将希望覆盖其中的一些，但大多数可以单独使用。您应该使用此标题字典：

Session

下面是我从标题词典中删除的字段列表以及删除它们的原因：

内容类型：当您在请求中为{ u"Origin":u"https://tumblr.com", u"Referer": u"https://www.tumblr.com/login", u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68", u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", u"Accept-Language":u"en-US,en;q=0.8", }参数提供字典时，我们会为您设置内容类型为data。没有必要自己动手。
连接：请求管理HTTP连接池并自行保留：不参与此过程，它只会出错。
接受编码：同样，请让请求设置此项，除非您真的准备处理解码内容。请求只知道如何application/x-www-form-urlencoded和gzip：如果您发送deflate并实际取回它，则必须自行解码。最好不要宣传你支持它。
缓存控制：无法缓存POST请求，因此无关紧要。

第四，我想在这里非常清楚，不要自己计算Content-Length 。请求将为您完成，并将正确。如果您自己发送该标题，那么Requests核心开发团队必须追逐各种奇怪的错误。没有充分的理由自己设置标题。考虑到这一点，您可以停止使用sdch个对象，然后返回使用PreparedRequest。

Python请求在这里做错了什么，或者我的POST请求缺少什么？

1 个答案: