我在python中创建了一个scraper来从网页上获取不同的批号。但是,当我运行我的刮刀时,我在控制台中看到“请求的网址无效”。我试图获取响应URL并发现它是有效的。在处理请求时我有什么问题吗? 我正在尝试的脚本:
import requests
from lxml import html
payload = {"keyword":"degas"}
headers={
"Content-Type":"text/html; charset=UTF-8",
"User-Agent":"Mozilla/5.0"
}
response = requests.get("http://www.sothebys.com/en/search-results.html?", params=payload, headers=headers, allow_redirects=False)
# tree = html.fromstring(response.text)
# for item in tree.cssselect("div.search-results-lot-number"):
# print(item.text)
print(response.url)
print(response.text)
print(response.status_code)
这是我在打印“response.url”,“response.text”和“response.status_code”时在控制台中得到的:
http://www.sothebys.com/en/search-results.html?keyword=degas
<HTML><HEAD>
<TITLE>Invalid URL</TITLE>
</HEAD><BODY>
<H1>Invalid URL</H1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.541d2017.1503578560.40be2bd
</BODY></HTML>
400
顺便说一句,如果我手动检查网址,那么我发现它确实引导我进入所需的网页。
答案 0 :(得分:1)
我认为你使用了错误的标题。以下标题为我工作:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}
输出:
<!DOCTYPE html>
<!--[if lt IE 7]> <html xml:lang="en" lang="en" class="no-js pre-ie9"> <![endif]-->
<!--[if IE 7]> <html xml:lang="en" lang="en" class="no-js ie7 pre-ie9"> <![endif]-->
<!--[if IE 8]> <html xml:lang="en" lang="en" class="no-js ie8 pre-ie9"> <![endif]-->
<!--[if IE 9]> <html xml:lang="en" lang="en" class="no-js ie9"> <![endif]-->
<!--[if gt IE 9]><!--> <html xml:lang="en" lang="en" class="no-js"> <!--<![endif]-->
<head>
<!--GLOBAL META-->
<!-- requestUrl=/content/sothebys/en/search-results.html?keyword=degas -->
<title>Search Results | Sotheby's</title>
<meta name="description" content="View auction details, art exhibitions and online catalogues; bid, buy and collect contemporary, impressionist or modern art, old masters, jewellery, wine, watches, prints, rugs and books at sotheby's auction house">
<meta name="keywords" content="auction, art, exhibition, online, catalogue, bid, buy, collect, contemporary, impressionist, modern, old mast...
。 。
答案 1 :(得分:0)
我已经按照以下方式运作了。
import requests
payload = {
'keyword':'degas',
'pageSize':'24',
'offset':'0'
}
headers={
'Accept':'application/json, text/javascript, */*; q=0.01',
'Referer':'http://www.sothebys.com/en/search-results.html?keyword=degas',
"User-Agent":"Mozilla/5.0"
}
response = requests.get("http://www.sothebys.com/en/search", params=payload, headers=headers)
print(response.url)
print(response.status_code)
print(response.text)