我正在尝试在不执行JavaScript的情况下在网页上抓取AJAX加载的部分。通过使用Chrome开发工具,我发现AJAX容器正在通过POST请求从URL中提取内容,因此我想使用python requests
包复制该请求。但是奇怪的是,使用Chrome提供的Headers
信息,我总是得到400错误,并且从Chrome复制的curl命令也会发生同样的情况。因此,我想知道是否有人可以分享一些见解。
我感兴趣的网站是here。使用Chrome:ctrl-shift-I,网络,XHR,我想要的部分是“内容”。我正在使用的脚本是:
headers = {"authority": "cafe.bithumb.com",
"path": "/boards/43/contents",
"method": "POST",
"origin":"https://cafe.bithumb.com",
"accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
"accept-encoding":"gzip, deflate, br",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"accept":"application/json, text/javascript, */*; q=0.01",
"referer":"https://cafe.bithumb.com/view/boards/43",
"x-requested-with":"XMLHttpRequest",
"scheme": "https",
"content-length":"1107"}
s=requests.Session()
s.headers.update(headers)
r = s.post('https://cafe.bithumb.com/boards/43/contents')
答案 0 :(得分:0)
您只需要比较两个帖子数据,就会发现除了几个参数(draw=page...start=xx
)之外,它们几乎相同。这意味着您可以通过修改draw
和start
来抓取Ajax数据。
编辑:数据已转换为字典,因此我们不需要urlencode
,也不需要cookie(我已测试)。
import requests
import json
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Origin": "https://cafe.bithumb.com",
"X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36",
"DNT": "1",
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"Referer": "https://cafe.bithumb.com/view/boards/43",
"Accept-Encoding": "gzip, deflate, br"
}
string = """columns[0][data]=0&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=1&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=false&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=2&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=false&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=3&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=false&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=4&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=false&columns[4][search][value]=&columns[4][search][regex]=false&start=30&length=30&search[value]=&search[regex]=false"""
article_root = "https://cafe.bithumb.com/view/board-contents/{}"
for page in range(1,4):
with requests.Session() as s:
s.headers.update(headers)
data = {"draw":page}
data.update( { ele[:ele.find("=")]:ele[ele.find("=")+1:] for ele in string.split("&") } )
data["start"] = 30 * (page - 1)
r = s.post('https://cafe.bithumb.com/boards/43/contents', data = data, verify = False) # set verify = False while you are using fiddler
json_data = json.loads(r.text).get("data") # transform string to dict then we can extract data easier
for each in json_data:
url = article_root.format(each[0])
print(url)