我正在尝试使用请求抓取黄页。我知道在这些页面上获取数据不需要登录,但我只是想尝试登录网站。
有没有办法使用" s.get()"一次抓住多个网址?这就是我目前的代码布局方式,但似乎应该有一个更简单的方法,这样每次我想添加一个新页面时都不需要再编写五行代码。
这段代码对我有用,但似乎太长了。
import requests
from bs4 import BeautifulSoup
import requests.cookies
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed®ister=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()
csrf = soup.find("input", value=True)["value"]
USERNAME = 'myusername'
PASSWORD = 'mypassword'
cj = s.cookies
requests.utils.dict_from_cookiejar(cj)
login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"})
targeted_page = s.get('http://m.yp.com/search?search_term=restaurants&search_type=category', cookies=cj)
targeted_soup = BeautifulSoup(targeted_page.content, "lxml")
targeted_soup.prettify()
for record in targeted_soup.findAll('div'):
print(record.text)
targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', cookies=cj)
targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
targeted_soup_2.prettify()
for data in targeted_soup_2.findAll('div'):
print(data.text)
当我尝试使用这样的字典时,我会得到一些我不理解的追溯。
import requests
from bs4 import BeautifulSoup
import requests.cookies
s = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
url = "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_register&vrid=cc9cb936-50d8-493b-83c6-842ec2f068ed®ister=true"
r = s.get(url).content
page = s.get(url)
soup = BeautifulSoup(page.content, "lxml")
soup.prettify()
csrf = soup.find("input", value=True)["value"]
USERNAME = 'myusername'
PASSWORD = 'mypassword'
login_data = dict(email=USERNAME, password=PASSWORD, _csrf=csrf)
s.post(url, data=login_data, headers={'Referer': "https://accounts.yellowpages.com/login?next=https%3A%2F%2Faccounts.yellowpages.com%2Fdialog%2Foauth&client_id=590d26ff-34f1-447e-ace1-97d075dd7421&response_type=code&app_id=WEB&source=ypu_login&vrid=63dbd394-afff-4794-aeb0-51dd19957ebc&merge_history=true"})
targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category',
'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA'
}
targeted_page = s.get(targeted_pages)
targeted_soup = BeautifulSoup(targeted_page.content, "lxml")
targeted_soup.prettify()
for record in targeted_soup.findAll('div'):
print(record.text)
targeted_page_2 = s.get('http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA')
targeted_soup_2 = BeautifulSoup(targeted_page_2.content, "lxml")
targeted_soup_2.prettify()
错误
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for '{'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA', 'http://m.yp.com/search?search_term=restaurants&search_type=category'}'
我是python和请求模块的新手,但我不明白为什么使用这种格式的字典不起作用。感谢您的任何意见。
答案 0 :(得分:0)
首先你有一个集而不是 dict ,如果你想要请求你需要迭代的每个网址, requests.get 将一个url作为其第一个arg而不是set或任何其他可迭代的url。:
targeted_pages = {'http://m.yp.com/search?search_term=restaurants&search_type=category',
'http://www.yellowpages.com/search?search_terms=Gas+Stations&geo_location_terms=Los+Angeles%2C+CA'
}
for target in targeted_pages:
targeted_page = s.get(target)
targeted_soup = BeautifulSoup(targeted_page.content, "lxml")
for record in targeted_soup.findAll('div'):
print(record.text)