我正在尝试解析一个网站的多个页面,但是我不明白如何更改url的查询(如果这有意义吗?)
我试图创建一个接下第一页并在每次找到下一页元素时添加+1的next_page,但是我想我做不到,因为我将有多个起始网址(都类似)。当我尝试获取下一页元素的信息时,它将返回以下内容:
[“ loadmoreresult('?networkId = 24&pageNumber = 2&pageSize = 100&allnet = yes&networkIds = 1&networkIds = 2&networkIds = 3&networkIds = 4&networkIds = 61&networkIds = 98&networkIds = 108&networkIds = 6&networkIds = 5&network = s&network = s&networkIds = 18&networkIds = 18&networkIds = 18&networkIds = 0&licenseIds = 0&licenseIds = 0&licenseIds = 0&licenseIds = 0&searchby = CountryCode&orderby = CountryCity&country = ES&city =&keyword =&lastCid = 116490');返回false;“]
使用url.parse(response.url).query我得到:
'networkId = 24&pageNumber = 1&pageSize = 100&allnet = yes&networkIds = 1&networkIds = 2&networkIds = 3&networkIds = 4&networkIds = 61&networkIds = 98&networkIds = 108&networkIds = 6&networkIds = 5&networkIds = 22&networkIds = ense&s = s&enses = 38&networkIds = 18&networkIds = 18&networkIds = ense 0&licenseIds = 0&licenseIds = 0&searchby = CountryCode&orderby = CountryCity&country = ES&city =&keyword ='
我要做的就是创建一个使用相同方案,路径的新链接,然后更改查询。
如果您需要更多信息,请告诉我,因为我仍然是初学者,所以我真的不知道有什么与您更相关。
from urllib.parse import urlparse, urljoin
urlparse(response.url)
>>> ParseResult(scheme='https', netloc='www.wcaworld.com', path='/Directory', params='', query='networkId=24&pageNumber=1&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=', fragment='')
response.css('a.loadmore::attr(onmouseover)').extract()
>>>["loadmoreresult('?networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'); return false;"]
答案 0 :(得分:0)
您需要获取该<a>
元素的基本URL,这是查询字符串开始https://example.com/a/path/?query=param
之前URL的一部分,因此此处的基本URL为https://example.com/a/path/
。将其保存到变量中。然后使用urllib.parse.parse_qsl
解析查询字符串,然后更新页码并将其与基本url连接起来。
from urllib.parse import parse_qsl, urljoin, urlencode
BASE_URL = 'https://example.com/a/path/'
# you can also extract base url from scrapy.Response object
# BASE_URL, _ = splitquery(response.url)
if __name__ == '__main__':
# extract query parameter from from a url
q = 'networkId=24&pageNumber=2&pageSize=100&allnet=yes&networkIds=1&networkIds=2&networkIds=3&networkIds=4&networkIds=61&networkIds=98&networkIds=108&networkIds=6&networkIds=5&networkIds=22&networkIds=13&networkIds=18&networkIds=15&networkIds=16&networkIds=105&networkIds=38&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&city=&keyword=&lastCid=116490'
parsed = dict(parse_qsl(q))
next_page = int(parsed['pageNumber']) + 1
parsed['pageNumber'] = next_page
next_page_url = urljoin(BASE_URL, '?' + urlencode(parsed))
print(next_page_url)
输出:
https://example.com/a/path/networkId=24&pageNumber=3&pageSize=100&allnet=yes&networkIds=38&licenseIds=0&searchby=CountryCode&orderby=CountryCity&country=ES&lastCid=116490