我正试图从中抓取一些数据
http://www.pogdesign.co.uk/cat/
。
我想获得每个节目的频道和播出时间,但问题是默认情况下它们不会出现。只有在手动配置设置并保存设置后,才会显示每个节目的频道和播出时间。
据我所知,在检查“网络”之后Chrome浏览器开发人员工具中的部分,点击“保存设置”后实际发生的情况。是发送POST请求,带有相关的数据参数(例如's_networks':'on'
等等),然后发送GET请求,以检索带有频道的html文件和显示的播出时间。
我尝试使用两者来模拟这个过程(POST请求,然后是GET请求)
python的requests
包和mechanicalsoup
包。
requests:
s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')
mechanicalsoup:
mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')
然而,我收到的回复中不包含频道和播出时间数据。
我注意到的唯一区别是从浏览器的POST请求返回的状态代码是302
,我的python请求返回的状态代码是200
。
答案 0 :(得分:3)
由于存储用户信息的cookie,您可以尝试以下代码
import requests
s = requests.Session()
data = {
"style": 3,
"timezone": "GMT",
"s_numbers": "on",
"s_epnames": "on",
"s_airtimes": "on",
"s_popups": "on",
"s_wunwatched": "on",
"s_sortbyname": "on",
"s_weekstyle": "on",
"s_24hr": "on",
"settings": None
}
cookies = { # you can get the cookie info from dev tool
"CAT_UID":'' ,
"PHPSESSID":'' ,
"_ga": '',
"_gid": '',
"_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text