Scrapy Post Data

时间:2016-09-27 01:45:36

标签: python curl scrapy

我从python请求转到scrapy,我想发一个帖子请求,点击Instagram主题标签页底部的按钮。

cURL就是这个

curl "https://www.instagram.com/query/" -H "cookie: mid=VwBJIwAEAAGiVNY3epWm9pRgD9Ge; fbm_124024574287414=base_domain=.instagram.com; ig_pr=1; ig_vw=956; s_network=; fbsr_124024574287414=5HQEzU7XMqOLO4KeQMmSvyBcKsH2svemV1-nWIE4_iM.eyJhbGdvcml0aG0iOiJITUFDLVNIQTI1NiIsImNvZGUiOiJBUUQ0TnNLMjVCZmdvUFN4TjdfODNQaW81Z3U4MTNaZmZWVlNCcEdJNUdRWlczdmdfNGVXNXJyck5Sc3NXRFlSWjZiZEpWMU95V3hNUUcwSE9qMHItYlRiYk40VXpNZG5aLUJ5Zzk0VWZNSW1RZTd4R1JzTS1yaXRabmc0Z3FYNkpwbnF4b0VXajRPNEVGSDVoTXBCUFNHUGNHN0RHQ01uSjFLeXh1dllOc2cyaFpnSDFheVI0RUhMbE1nZGM4emVrNm9DXzdLa2s1TUoyYzhyYmEwWXo1VkI1bVVmS3NvLS11dXVxdjJlRmxFUHpYczVNQ3E1bW5BRk5IeWxxMG9veENQcXcwWUVLSnpsNnZSUzFReGUzQWpsQzFPU0cySU1QM0wwMGhUcnRraFF4ZEFhZElVMUtNNUw5VTRab2dlbjltdUFadkJjV0U3UUMxeTdibDRyTzhwWCIsImlzc3VlZF9hdCI6MTQ3NDkzODQ3MywidXNlcl9pZCI6IjEzNzc3ODgzNjkifQ; csrftoken=th33gPnvrsNS74reomY69ETfojX2avQ7" -H "origin: https://www.instagram.com" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.8" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36" -H "x-requested-with: XMLHttpRequest" -H "x-csrftoken: th33gPnvrsNS74reomY69ETfojX2avQ7" -H "x-instagram-ajax: 1" -H "content-type: application/x-www-form-urlencoded" -H "accept: */*" -H "referer: https://www.instagram.com/explore/tags/love/" -H "authority: www.instagram.com" --data "q=ig_hashtag(love)+"%"7B+media.after(J0HV-nGYwAAAF0HV-nGXAAAAFjgA"%"2C+10)+"%"7B"%"0A++count"%"2C"%"0A++nodes+"%"7B"%"0A++++caption"%"2C"%"0A++++code"%"2C"%"0A++++comments+"%"7B"%"0A++++++count"%"0A++++"%"7D"%"2C"%"0A++++comments_disabled"%"2C"%"0A++++date"%"2C"%"0A++++dimensions+"%"7B"%"0A++++++height"%"2C"%"0A++++++width"%"0A++++"%"7D"%"2C"%"0A++++display_src"%"2C"%"0A++++id"%"2C"%"0A++++is_video"%"2C"%"0A++++likes+"%"7B"%"0A++++++count"%"0A++++"%"7D"%"2C"%"0A++++owner+"%"7B"%"0A++++++id"%"0A++++"%"7D"%"2C"%"0A++++thumbnail_src"%"2C"%"0A++++video_views"%"0A++"%"7D"%"2C"%"0A++page_info"%"0A"%"7D"%"0A+"%"7D&ref=tags"%"3A"%"3Ashow" --compressed

因此,对于表单数据,我尝试了两件事:

body = response.xpath("//body")
html = str(body.extract())
end_cursor = re.search(r"\"end\_cursor\"\: \"(.+?)\"", html).group(1)

data = "q=ig_hashtag({})+%7B+media.after({}+10)+%7B%0A++count%2C%0A++nodes+%7B%0A++++caption%2C%0A++++code%2C%0A++++comments+%7B%0A++++++count%0A++++%7D%2C%0A++++comments_disabled%2C%0A++++date%2C%0A++++dimensions+%7B%0A++++++height%2C%0A++++++width%0A++++%7D%2C%0A++++display_src%2C%0A++++id%2C%0A++++is_video%2C%0A++++likes+%7B%0A++++++count%0A++++%7D%2C%0A++++owner+%7B%0A++++++id%0A++++%7D%2C%0A++++thumbnail_src%2C%0A++++video_views%0A++%7D%2C%0A++page_info%0A%7D%0A+%7D&ref=tags%3A%3Ashow".format(tag, end_cursor)
url = 'https://www.instagram.com/query/'

yield Request(url, body=data, method="POST", callback=self.parseHashtag)

和这个

data = {"q" :"ig_hashtag({})+%7B+media.after({}+10)+%7B%0A++count%2C%0A++nodes+%7B%0A++++caption%2C%0A++++code%2C%0A++++comments+%7B%0A++++++count%0A++++%7D%2C%0A++++comments_disabled%2C%0A++++date%2C%0A++++dimensions+%7B%0A++++++height%2C%0A++++++width%0A++++%7D%2C%0A++++display_src%2C%0A++++id%2C%0A++++is_video%2C%0A++++likes+%7B%0A++++++count%0A++++%7D%2C%0A++++owner+%7B%0A++++++id%0A++++%7D%2C%0A++++thumbnail_src%2C%0A++++video_views%0A++%7D%2C%0A++page_info%0A%7D%0A+%7D&ref=tags%3A%3Ashow".format(tag, end_cursor)}
yield FormRequest(url, formdata=data, callback=self.parseHashtag)

我收到403错误,所以我显然是错误地发送数据,我是不正确地格式化数据还是错误地调用了帖子?这是我的两个想法,但我很不确定。非常感谢任何帮助,谢谢。

网址是这个 - https://www.instagram.com/explore/tags/love/

这是我的git,https://github.com/Fuledbyramen/instagram_crawler/blob/master/instagram/spiders/instagram_spider.py

1 个答案:

答案 0 :(得分:1)

您似乎缺少正确的标题或任何标题。

您应该提供您在网络检查器中看到的每个标头,除了scrapy管理并自行填充的Cookie。

您可以轻松地从curl字符串网络中提取标题,检查通过以下方式提供:

foo = '''-H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.8" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36" -H "x-requested-with: XMLHttpRequest" -H "x-csrftoken: th33gPnvrsNS74reomY69ETfojX2avQ7" -H "x-instagram-ajax: 1" -H "content-type: application/x-www-form-urlencoded" -H "accept: */*" -H "referer: https://www.instagram.com/explore/tags/love/" -H "authority: www.instagram.com"'''
headers = [s.strip(' "').split(': ') for s in foo.split('-H')]
headers = [h for h in headers if any(h)]
headers = {k: v for k,v in headers}

你会得到:

 {'accept': '*/*',
 'accept-encoding': 'gzip, deflate, br',
 'accept-language': 'en-US,en;q=0.8',
 'authority': 'www.instagram.com',
 'content-type': 'application/x-www-form-urlencoded',
 'referer': 'https://www.instagram.com/explore/tags/love/',
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
 'x-csrftoken': 'th33gPnvrsNS74reomY69ETfojX2avQ7',
 'x-instagram-ajax': '1',
 'x-requested-with': 'XMLHttpRequest'}

其中一些完全没有必要,例如referer主要用于分析,接受语言,接受和接受编码很可能被忽略。用户代理也由scrapy管理。

所以你剩下的是x-crsftoken,它可能什么也不做,但通常那些隐藏在html源代码中的某个地方; x-instagram-ajax似乎是一个表示ajax请求的静态标头; x-requested-with显示请求类型,主要是为了防止中间人攻击,你应该拥有它,因为它是指示请求类型以避免被阻止。

编辑: 我已经尝试过这个网站,你可以通过body作为url参数执行GET请求。只需在网络检查中右键单击请求,然后点击copy location with parameters,这将自动在url参数中转换来自正文的类似dict的数据。

即。 https://www.instagram.com/query/?q=ig_hashtag(scrapy)%20%7B%20media.after(J0HV-vvswAAAF0HV-Qp7AAAAFiYA%2C%2016)%20%7B%0A%20%20count%2C%0A%20%20nodes%20%7B%0A%20%20%20%20caption%2C%0A%20%20%20%20code%2C%0A%20%20%20%20comments%20%7B%0A%20%20%20%20%20%20count%0A%20%20%20%20%7D%2C%0A%20%20%20%20comments_disabled%2C%0A%20%20%20%20date%2C%0A%20%20%20%20dimensions%20%7B%0A%20%20%20%20%20%20height%2C%0A%20%20%20%20%20%20width%0A%20%20%20%20%7D%2C%0A%20%20%20%20display_src%2C%0A%20%20%20%20id%2C%0A%20%20%20%20is_video%2C%0A%20%20%20%20likes%20%7B%0A%20%20%20%20%20%20count%0A%20%20%20%20%7D%2C%0A%20%20%20%20owner%20%7B%0A%20%20%20%20%20%20id%0A%20%20%20%20%7D%2C%0A%20%20%20%20thumbnail_src%2C%0A%20%20%20%20video_views%0A%20%20%7D%2C%0A%20%20page_info%0A%7D%0A%20%7D&ref=tags%3A%3Ashow