从Google ReCaptcha Scraping获取Bad Captcha图像

时间:2014-07-23 08:02:09

标签: javascript vb.net cookies recaptcha

我正在尝试加载Captcha,然后在WebBrowser Control中渲染它们,然后复制/粘贴图像并将其渲染到图片框中。

为什么不立即将图片直接下载到PictureBox中,这样可以减少CPU使用率和内存,这个解决方案适用于任何其他更高级的验证码服务,称为Solve Media(如果您查看,则使用Solve Media)下次您尝试查看图像网址时,它会为您提供虚假的错误catpcha图像。)

但现在我需要支持ReCaptcha Captcha系统,以便更快地自动化我的机器人,然后只需刷新网页并等待它呈现。

所以我只是在这里编写我的代码,据我所知,我只是缺少模拟HTML中的一个属性请求我将User-Agent伪造成一个真正的Internet Explorer 8,我认为问题是Cookie似乎以某种方式生成了一个cookie,我无法弄清楚在哪里,但我也想通过下载Javascript文件得到一个Cookie。

无论哪种方式谷歌ReCaptcha试图欺骗你一个假的Captcha,你无法阅读,在你的脸上擦你不做正确的事。我知道当你看到2个黑色圆圈时,显然它是假的。

以下是Bad Captcha和Good Captcha

的示例

captcha good captcha

有一次我记得ReCaptcha还有另一个安全功能,如果你从它放置的实际域中加载了Captcha图像,不知道怎么知道我不知道它是如何工作的,因为我在本地下载了所有东西?但他们似乎已经删除了这个功能。 (实际上它存在于某些网站上似乎默认是禁用的,很容易欺骗它使用Referer标头)

我不想在这里欺骗任何东西我仍然会手动输入这些Captcha,但是我想要更快地输入它们然后通常需要渲染页面。

我希望Captcha成为那些街道号码..或至少2个没有黑圈的单词。

无论如何,这是我的现行守则。

Dim newCaptcha = New Captcha
Dim myUserAgent As String = ""
Dim myReferer As String = "http://www.google.com/recaptcha/demo/"
Dim outputSite As String = HTTP.HTTPGET("http://www.google.com/recaptcha/demo/", "", "", "", myUserAgent, myReferer)
Dim recaptchaChallengeKey = GetBetween(outputSite, "http://www.google.com/recaptcha/api/challenge?k=", """")

'Google ReCaptcha Captcha
outputSite = HTTP.HTTPGET("http://www.google.com/recaptcha/api/challenge?k=" & recaptchaChallengeKey, "", "", "", myUserAgent, myReferer)

'outputSite = outputSite.Replace("var RecaptchaState = {", "{""RecaptchaState"": {")
'outputSite = outputSite.Replace("};", "}}")
'Dim jsonDictionary As Dictionary(Of String, Object) = New JavaScriptSerializer().Deserialize(Of Dictionary(Of String, Object))(outputSite)
Dim recaptchaChallenge = GetBetween(outputSite, "challenge : '", "',")
outputSite = HTTP.HTTPGET("http://www.google.com/recaptcha/api/js/recaptcha.js", "", "", "", myUserAgent, myReferer) 'This page looks useless but it seems the javascript loads this anyways, maybe this why I get bad captchas?

If HTTP.LoadWebImageToPictureBox(newCaptcha.picCaptcha, "http://www.google.com/recaptcha/api/image?c=" & recaptchaChallenge, myUserAgent, myReferer) = False Then
    MessageBox.Show("Recaptcha Image loading failed!")
Else
    Dim newWork As New Work
    newWork.CaptchaForm = newCaptcha
    newWork.AccountId = 1234 'ID of Accounts.
    newWork.CaptchaHash = "recaptcha_challenge_field=" & recaptchaChallenge
    newWork.CaptchaType = "ReCaptcha"
    Works.Add(newWork)
    newCaptcha.Show()
End If

这是我使用的HTTP类。

Imports System.Collections.Generic
Imports System.Linq
Imports System.Text
Imports System.Net
Imports System.IO
Public Class HTTP

    Public StoredCookies As New CookieContainer

    Public Function HTTPGET(ByVal url As String, ByVal proxyname As String, ByVal proxylogin As String, ByVal proxypassword As String, ByVal userAgent As String, ByVal referer As String) As String
        Dim resp As HttpWebResponse
        Dim req As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)

        If userAgent = "" Then
            userAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
        End If
        req.UserAgent = userAgent
        req.Referer = referer
        req.AllowAutoRedirect = True
        req.ReadWriteTimeout = 5000
        req.CookieContainer = StoredCookies
        req.Headers.Set("Accept-Language", "en-us")

        req.KeepAlive = True
        req.Method = "GET"

        Dim stream_in As StreamReader

        If proxyname <> "" Then
            Dim proxyIP As String = proxyname.Split(New Char() {":"})(0)
            Dim proxyPORT As Integer = CInt(proxyname.Split(New Char() {":"})(1))

            Dim proxy As New WebProxy(proxyIP, proxyPORT)
            'if proxylogin is an empty string then don't use proxy credentials (open proxy)
            If proxylogin <> "" Then
                proxy.Credentials = New NetworkCredential(proxylogin, proxypassword)
            End If
            req.Proxy = proxy
        End If

        Dim response As String = ""
        Try
            resp = DirectCast(req.GetResponse(), HttpWebResponse)
            StoredCookies.Add(resp.Cookies)
            stream_in = New StreamReader(resp.GetResponseStream())
            response = stream_in.ReadToEnd()
            stream_in.Close()
        Catch ex As Exception
        End Try
        Return response
    End Function


    Public Function LoadWebImageToPictureBox(ByVal pb As PictureBox, ByVal ImageURL As String, ByVal userAgent As String, ByVal referer As String) As Boolean
        Dim bAns As Boolean

        Try
            Dim resp As WebResponse
            Dim req As HttpWebRequest

            Dim sURL As String = Trim(ImageURL)

            If Not sURL.ToLower().StartsWith("http://") Then sURL = "http://" & sURL

            req = DirectCast(WebRequest.Create(sURL), HttpWebRequest)

            If userAgent = "" Then
                userAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
            End If
            req.UserAgent = userAgent
            req.Referer = referer
            req.AllowAutoRedirect = True
            req.ReadWriteTimeout = 5000
            req.CookieContainer = StoredCookies
            req.Headers.Set("Accept-Language", "en-us")

            req.KeepAlive = True
            req.Method = "GET"

            resp = req.GetResponse()
            If Not resp Is Nothing Then
                Dim remoteStream As Stream = resp.GetResponseStream()
                Dim objImage As New MemoryStream
                Dim bytesProcessed As Integer = 0
                Dim myBuffer As Byte()
                ReDim myBuffer(1024)
                Dim bytesRead As Integer
                bytesRead = remoteStream.Read(myBuffer, 0, 1024)
                Do While (bytesRead > 0)
                    objImage.Write(myBuffer, 0, bytesRead)
                    bytesProcessed += bytesRead
                    bytesRead = remoteStream.Read(myBuffer, 0, 1024)
                Loop
                pb.Image = Image.FromStream(objImage)
                bAns = True
                objImage.Close()
            End If
        Catch ex As Exception
            bAns = False
        End Try

        Return bAns
    End Function
End Class

编辑:我发现了这个Google Javascript Clientside Obfuscated Encryption system的问题

http://www.google.com/js/th/1lOyLe_nzkTfeM2GpTkE65M1Lr8y0MC8hybXoEd-x1s.js

我仍然希望能够在不使用繁重的webbrowser的情况下击败它,也许一些轻量级的快速javascript评估控件? 毫不犹豫地将其移植到VB.NET并将其移植到VB.NET,因为一旦我这样做,它们可能会完全改变一些变量或加密,并且我做了所有无用的工作,所以我想要一些更聪明的东西。在这一点上,我甚至不知道如何生成URL它现在看起来似乎是静态的,它可能是一个真正的文件而不仅仅是在生成时间的文件中。

结果显示_challenge页面对图像提出挑战只是一个诱饵挑战..那个挑战然后被替换(加密可能?)客户端使用变量t1,t2,t3,似乎这个加密每次都没有使用,如果你通过它,一旦你获得了我正在尝试做的事情,我的代码几乎可以工作,但是它会在非常随机的时间间隔内停止工作,我想要一些更稳固的东西,我可以无人看管几个星期。 / p>

1 个答案:

答案 0 :(得分:9)

我遇到了同样的问题,并找到了一个解决方案,它不会提供最简单的验证码,但至少可以提供更容易的图像。结果将是一个可读的单词和一个模糊的。

我发现下载“recaptcha / api / reload”对于实现这一点非常重要。 也许它可以添加“cachestop”参数,也许是参考者。

data = UrlMgr("http://www.google.com/recaptcha/api/challenge?k=%s&cachestop=%.17f" % (id, random.random()), referer=referer, nocache=True).data
challenge = re.search("challenge : '(.*?)',", data).group(1)
server = re.search("server : '(.*?)',", data).group(1)
# this step is super important to get readable captchas - normally we could take the "c" from above and already retrieve a captcha but
# this one would be barely readable
reloadParams["c"] = challenge
reloadParams["k"] = id
reloadParams["lang"] = "de"
reloadParams["reason"] = "i"
reloadParams["type"] = "image"
data = UrlMgr("http://www.google.com/recaptcha/api/reload" , params=reloadParams, referer=referer, nocache=True).data
challenge = textextract(data, "Recaptcha.finish_reload('", "',")
return challenge, solveCaptcha(UrlMgr("%simage" % (server), params={"c":challenge}, referer=referer))

对于进一步的改进,我的猜测是“th”参数用于检测机器人。它是由一些复杂的javascript生成的,我自己没有调试。