数据抓取来自pogdesign.co.uk/cat/

时间:2017-07-31 11:08:06

标签: python web-scraping beautifulsoup python-requests mechanicalsoup

我正试图从中抓取一些数据 http://www.pogdesign.co.uk/cat/

我想获得每个节目的频道和播出时间,但问题是默认情况下它们不会出现。只有在手动配置设置并保存设置后,才会显示每个节目的频道和播出时间。

据我所知,在检查“网络”之后Chrome浏览器开发人员工具中的部分,点击“保存设置”后实际发生的情况。是发送POST请求,带有相关的数据参数(例如's_networks':'on'等等),然后发送GET请求,以检索带有频道的html文件和显示的播出时间。

我尝试使用两者来模拟这个过程(POST请求,然后是GET请求) python的requests包和mechanicalsoup包。

requests:

s = requests.Session()
s.post('http://www.pogdesign.co.uk/cat/', data = {'s_networks':'on'})
s.get('http://www.pogdesign.co.uk/cat/')

mechanicalsoup:

mcs = mechanicalsoup.Browser()
res_post = mcs.post('http://www.pogdesign.co.uk/cat/', data {'s_networks':'on'})
res_get = mcs.get('http://www.pogdesign.co.uk/cat/')

然而,我收到的回复中不包含频道和播出时间数据。

我注意到的唯一区别是从浏览器的POST请求返回的状态代码是302,我的python请求返回的状态代码是200

1 个答案:

答案 0 :(得分:3)

由于存储用户信息的cookie,您可以尝试以下代码

import requests

s = requests.Session()
data = {
    "style": 3,
    "timezone": "GMT",
    "s_numbers": "on",
    "s_epnames": "on",
    "s_airtimes": "on",
    "s_popups": "on",
    "s_wunwatched": "on",
    "s_sortbyname": "on",
    "s_weekstyle": "on",
    "s_24hr": "on",
    "settings": None
}
cookies = { # you can get the cookie info from dev tool
    "CAT_UID":'' ,
    "PHPSESSID":'' ,
    "_ga": '',
    "_gid": '',
    "_gat": ""
}
post = s.post('http://www.pogdesign.co.uk/cat/', data=data, cookies=cookies)
text = post.text
get = s.get('http://www.pogdesign.co.uk/cat/', cookies=cookies)
text1 = get.text