Question

我在python中创建了一个scraper来从网页上获取不同的批号。但是，当我运行我的刮刀时，我在控制台中看到“请求的网址无效”。我试图获取响应URL并发现它是有效的。在处理请求时我有什么问题吗？我正在尝试的脚本：

import requests
from lxml import html

payload = {"keyword":"degas"}

headers={
"Content-Type":"text/html; charset=UTF-8",
"User-Agent":"Mozilla/5.0"
}

response = requests.get("http://www.sothebys.com/en/search-results.html?", params=payload, headers=headers, allow_redirects=False)
# tree = html.fromstring(response.text)
# for item in tree.cssselect("div.search-results-lot-number"):
#     print(item.text)

print(response.url)
print(response.text)
print(response.status_code)

这是我在打印“response.url”，“response.text”和“response.status_code”时在控制台中得到的：

http://www.sothebys.com/en/search-results.html?keyword=degas
<HTML><HEAD>
<TITLE>Invalid URL</TITLE>
</HEAD><BODY>
<H1>Invalid URL</H1>
The requested URL "&#91;no&#32;URL&#93;", is invalid.<p>
Reference&#32;&#35;9&#46;541d2017&#46;1503578560&#46;40be2bd
</BODY></HTML>

400

顺便说一句，如果我手动检查网址，那么我发现它确实引导我进入所需的网页。

Answer 1

我认为你使用了错误的标题。以下标题为我工作：

headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Cafari/537.36'}

输出：

<!DOCTYPE html>
<!--[if lt IE 7]>      <html xml:lang="en" lang="en" class="no-js pre-ie9"> <![endif]-->
<!--[if IE 7]>         <html xml:lang="en" lang="en" class="no-js ie7 pre-ie9"> <![endif]-->
<!--[if IE 8]>         <html xml:lang="en" lang="en" class="no-js ie8 pre-ie9"> <![endif]-->
<!--[if IE 9]>         <html xml:lang="en" lang="en" class="no-js ie9"> <![endif]-->
<!--[if gt IE 9]><!--> <html xml:lang="en" lang="en" class="no-js"> <!--<![endif]-->

<head>
    <!--GLOBAL META-->
<!-- requestUrl=/content/sothebys/en/search-results.html?keyword=degas -->
<title>Search Results | Sotheby's</title>
<meta name="description" content="View auction details, art exhibitions and online catalogues; bid, buy and collect contemporary, impressionist or modern art, old masters, jewellery, wine, watches, prints, rugs and books at sotheby's auction house">
<meta name="keywords" content="auction, art, exhibition, online, catalogue, bid, buy, collect, contemporary, impressionist, modern, old mast...

。。

Answer 2

我已经按照以下方式运作了。

import requests

payload = {
    'keyword':'degas',
    'pageSize':'24',
    'offset':'0'
    }

headers={
    'Accept':'application/json, text/javascript, */*; q=0.01',
    'Referer':'http://www.sothebys.com/en/search-results.html?keyword=degas',
    "User-Agent":"Mozilla/5.0"
    }

response = requests.get("http://www.sothebys.com/en/search", params=payload, headers=headers)

print(response.url)
print(response.status_code)
print(response.text)

Scraper抛出无效的url错误

2 个答案: