这里是新程序员。我在学习python时尝试访问webscrape pof网站。尝试使用请求和漂亮的汤进行webscrape。预先感谢
错误似乎来自res = requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s'%pageId)
我尝试删除了分页页面,只刮了一页,但是没有用 还尝试在每个请求之间使用time.sleep 3秒,但这也不起作用
#Username and password
username='MyUsername'
password='MyPassword'
#Login to pof site
from selenium import webdriver
import bs4,requests
browser = webdriver.Chrome(executable_path='/Users/Desktop/geckodriver-v0.24.0-win32/chromedriver.exe')
browser.get('https://www.pof.com')
linkElem= browser.find_element_by_link_text('Sign In')
linkElem.click()
usernameElem=browser.find_element_by_id('logincontrol_username')
usernameElem.send_keys(username)
passwordElem=browser.find_element_by_id('logincontrol_password')
passwordElem.send_keys(password)
passwordElem.submit()
#Webscraping online profile links from first 7 pagination pages
for pageId in range(7):
res=requests.get('https://www.pof.com/everyoneonline.aspx?page_id=%s' %pageId)
res.raise_for_status()
soup= bs4.BeautifulSoup(res.text)
profile = soup.findAll('div', attrs={'class' : 'rc'})
for div in profile:
print (div.findAll('a')['href'])
预期结果: 打印配置文件的所有href链接的列表,以便稍后将其保存到csv
实际结果:
requests.exceptions.ConnectionError :(“连接已中止。”,ConnectionResetError(10054,“远程主机强行关闭了现有连接”,无,10054,无))enter code here
答案 0 :(得分:0)
在抓取网页时,我将为您提供一些常规信息:
re
模块。 BeautifulSoup
很棒,但就一般用途而言,re
就我的经验而言更容易。现在就回答您的问题;有很多不同的网页,但这是我建议从所有网页中抓取的方式:
Network
部分。在这里,您可以看到浏览器发出的所有请求,以及标题和来源。GET
或您的情况下POST
方法的请求。:
开头的标头,例如:method: POST
)headers = {
"accept": "application/json, text/javascript, */*; q=0.01",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9,fa-IR;q=0.8,fa;q=0.7,de;q=0.6",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"dnt": "1",
"origin": "https://stackoverflow.com",
"referer": "https://stackoverflow.com/questions/56399462/error-message-10054-when-wescraping-with-requests-module",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36",
}
Headers
部分上应该有另一部分,命名为“ Payload”或“ Form Data”行。将其内容放入另一个python字典中,然后根据需要更改其内容。现在,您可以准备将提取的数据用于python请求,然后在响应内容上使用re
或BeautifulSoup
提取所需的数据。
在此示例中,我登录到https://aavtrain.com/index.asp
尝试遵循我编写的步骤,并了解此处发生的情况:
import requests
username = "something"
password = "somethingelse"
headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"accept-encoding": "gzip, deflate, br",
"cache-control": "max-age=0",
"content-type": "application/x-www-form-urlencoded",
"dnt": "1",
"origin": "https://aavtrain.com",
"referer": "https://aavtrain.com/index.asp",
"upgrade-insecure-requests": "1",
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
data = {
"user_name": username,
"password": password,
"Submit": "Submit",
"login": "true"
}
with requests.Session() as session:
session.get("https://aavtrain.com/index.asp")
loggedIn = session.post("https://aavtrain.com/index.asp", headers=headers, data=data)
#... do stuff after logged in..
我希望这会有所帮助,问任何缠绵的问题,我会尽快回复您。