我已经使用python创建了一个脚本,以仅获取显示网页中有多少数据的内容。当我尝试在脚本中使用的链接时,会看到类似Showing 1-30 of 18893
的结果(这不是我想要的结果),但是当我尝试下面的链接时,却得到了Showing 1-30 of 196
(预期的输出)。底线是:>我使用直接链接获得成功,但是当脚本使用由params生成的url时获得了其他东西。
我尝试过:
import requests
from bs4 import BeautifulSoup
link = "https://www.yelp.com/search?"
params = {
'find_desc': 'Restaurants',
'find_loc': 'New York, NY',
'l: p':'NY:New_York:Manhattan:Alphabet_City'
}
resp = requests.get(link,params=params)
soup = BeautifulSoup(resp.text,"lxml")
total = soup.select_one("p:contains(Showing)").text
print(total)
获取:
Showing 1-30 of 18894
预期输出:
Showing 1-30 of 196
此外,我使用resp.url
获得的链接:
https://www.yelp.com/search?find_desc=Restaurants&find_loc=New+York%2C+NY&l%3A+p=NY%3ANew_York%3AManhattan%3AAlphabet_City
但是我期望的链接是:
https://www.yelp.com/search?find_desc=Restaurants&find_loc=New%20York%2C%20NY&l=p%3ANY%3ANew_York%3AManhattan%3AAlphabet_City
如何使脚本在内容的正确URL中填充?
答案 0 :(得分:1)
您在'l: p':'NY:New_York:Manhattan:Alphabet_City'
参数中有错字。
使用urllib.parse.parse_qs
然后复制参数是一个好主意,而不是尝试自己对其进行解码。
这是固定版本:
import requests
from bs4 import BeautifulSoup
link = "https://www.yelp.com/search"
params = {
'find_desc': 'Restaurants',
'find_loc': 'New York, NY',
'l': 'p:NY:New_York:Manhattan:Alphabet_City'
}
res = requests.get(link,params=params)
soup = BeautifulSoup(res.text, 'html.parser')
print(res.url)
total = soup.select_one("p:contains(Showing)").text
print(total)
输出:
https://www.yelp.com/search?find_desc=Restaurants&find_loc=New+York%2C+NY&l=p%3ANY%3ANew_York%3AManhattan%3AAlphabet_City
Showing 1-30 of 196