我正在尝试从此页面抓取艺术家网址
https://myspace.com/discover/artists?genreId=1002532
但是这个页面正在进行ajax调用以获取用户deatils.I可以在firebug中看到这个url
https://myspace.com/ajax/artistspage?chartType=heavyrotation&genreId=1002532&page=0
如果我在单独的标签页中打开此网址,则不会显示任何内容,但如果我在firebug中查看响应标签,则会显示所有详细信息。
我怎么能得到所有内容?
答案 0 :(得分:1)
当您尝试在浏览器中手动转到https://myspace.com/ajax/artistspage?chartType=heavyrotation&genreId=1002532&page=0时,请查看firebug中的请求,您会发现它收到401 Unauthorized
响应。这是因为当从官方myspace页面https://myspace.com/discover/artists?genreId=1002532请求时,请求标头以特殊方式设置,这使得对数据的请求有效。当您的浏览器请求数据时,这些标头不存在。
以下是有效的标题:
Accept:*/*
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Client:persistentId=53065c06-c877-47c5-933a-4b22d7f28cd9&screenWidth=1440&screenHeight=900&timeZoneOffsetHours=7&visitId=31c9d922-9984-4ac5-9bb0-0bb253bc89c3&windowWidth=1043&windowHeight=407
Connection:keep-alive
Cookie:persistent_id=pid%3D53065c06-c877-47c5-933a-4b22d7f28cd9%26llid%3D%26lprid%3D%26lltime%3D; beacons_enabled=true; __utmt=1; ads=adInitVisit%3D1432446031357; player=sequenceId%3D-1%26paused%3Dtrue%26currentTime%3D0%26volume%3D0.5%26mute%3Dfalse%26shuffled%3Dfalse%26repeat%3Doff%26mode%3Dqueue%26radioEntity%3D%26radioMediaType%3D%26radioMediaId%3D%26radioCurrentTime%3D0%26pinned%3Dfalse%26streamStartDateTime%3D%26radioStreamStartDateTime%3D%26at%3D360%26incognito%3Dfalse%26allowSkips%3Dtrue%26ccOn%3Dfalse; visit_id=31c9d922-9984-4ac5-9bb0-0bb253bc89c3; __utma=102911388.1051160901.1432446029.1432446029.1432446029.1; __utmb=102911388.2.10.1432446029; __utmc=102911388; __utmz=102911388.1432446029.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
DNT:1
Hash:NjI2YWM0YzM0YmJiZTg1NsKqwpMGw4HCuAvClMOGwoxAXMOXw50Qw5PCnH7DqVQIAygsY25wwrfCtsOcd8KuwqnCiMKSwobCrMKswpvDhEIrDcKYM0rCocKbJcKYEsKWw53Dr8KIwq7CgMKWw5XCo8KBGHVvURQKwpzDrMO9w5fDlsKzNhDChMOtw7wgw7NuDsK0wq1oC1sOOXAzK8KuwqdyEUDDnRk+w6BPwrIhfsKtw7Fewrcpa8Okw4c%3D
Host:myspace.com
Pragma:no-cache
Referer:https://myspace.com/discover/artists?genreId=1002532
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2409.0 Safari/537.36
X-Requested-With:XMLHttpRequest
以下是无效的:
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Connection:keep-alive
Cookie:persistent_id=pid%3D53065c06-c877-47c5-933a-4b22d7f28cd9%26llid%3D%26lprid%3D%26lltime%3D; beacons_enabled=true; __utmt=1; ads=adInitVisit%3D1432446031357; __utma=102911388.1051160901.1432446029.1432446029.1432446029.1; __utmb=102911388.2.10.1432446029; __utmc=102911388; __utmz=102911388.1432446029.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); player=sequenceId=-1&paused=true¤tTime=0&volume=0.5&mute=false&shuffled=false&repeat=off&mode=queue&radioCurrentTime=0&pinned=false&at=360&incognito=false&allowSkips=true&ccOn=false; visit_id=31c9d922-9984-4ac5-9bb0-0bb253bc89c3
DNT:1
Host:myspace.com
Pragma:no-cache
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2409.0 Safari/537.36
您会注意到存在一些差异,最重要的是有效请求标头包含Hash
以及Referer
标头。我假设至少Hash必须存在才能由服务器验证。您必须找出如何在myspace页面上生成此Hash,并且可能还设置Referer
标记以伪造来自正确页面的请求。
如果你深入了解页面上的JS,你会发现这个代码片段位于https://x.myspacecdn.com/new/common/js/global.7A07230F0926F7451E2F85D8F2C647D0.min.js
a.setRequestHeader("Hash",context.hashMashter)
这是使用context.hashMashter设置Hash标头的位置,如果你转到https://x.myspacecdn.com/new/common/js/authentication.68B094D880713CC3A9EB77F984FC09F4.min.js,你可以看到使用此代码段设置:
context.hashMashter=a.hashMashter
我不知道a
到底是什么,但如果你想继续探索,我认为这是一个好的开始。